CVE-2021-47209

Name	CVE-2021-47209
Description	In the Linux kernel, the following vulnerability has been resolved: sched/fair: Prevent dead task groups from regaining cfs_rq's Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) \|\| cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : ---truncated---
Source	CVE (at NVD; CERT, LWN, oss-sec, fulldisc, Red Hat, Ubuntu, Gentoo, SUSE bugzilla/CVE, GitHub advisories/code/issues, web search, more)

Vulnerable and fixed packages

The table below lists information on source packages.

Source Package	Release	Version	Status
linux (PTS)	jessie, jessie (lts)	3.16.84-1	vulnerable
	stretch (security)	4.9.320-2	vulnerable
	stretch (lts), stretch	4.9.320-3	vulnerable
	buster (security), buster, buster (lts)	4.19.316-1	fixed
	bullseye	5.10.223-1	fixed
	bullseye (security)	5.10.226-1	fixed
	bookworm	6.1.115-1	fixed
	bookworm (security)	6.1.119-1	fixed
	trixie	6.12.5-1	fixed
	sid	6.12.6-1	fixed

The information below is based on the following data on fixed versions.

Package	Type	Release	Fixed Version	Urgency
linux	source	jessie	(unfixed)	end-of-life
linux	source	stretch	(unfixed)	end-of-life
linux	source	buster	(not affected)
linux	source	bullseye	(not affected)
linux	source	(unstable)	5.15.5-1

Notes

[bullseye] - linux <not-affected> (Vulnerable code not present)
[buster] - linux <not-affected> (Vulnerable code not present)
https://git.kernel.org/linus/b027789e5e50494c2325cc70c8642e7fd6059479 (5.16-rc1)