From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wm1-x332.google.com (mail-wm1-x332.google.com
 [IPv6:2a00:1450:4864:20::332])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id C4C373CBC9;
 Tue,  7 May 2024 11:26:06 -0400 (EDT)
Received: by mail-wm1-x332.google.com with SMTP id
 5b1f17b1804b1-41ecd60bb16so24935405e9.0; 
 Tue, 07 May 2024 08:26:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1715095565; x=1715700365; darn=lists.bufferbloat.net;
 h=to:subject:message-id:date:from:in-reply-to:references:mime-version
 :from:to:cc:subject:date:message-id:reply-to;
 bh=bztS6Rgqwj4DFMgeR5niFaWP01s0rif/ALdC+WLW0UY=;
 b=LPmHRT2MCuzkm7OrY+sn2PCHJpBY3PyTZDViBfJZo1pj1sJRJVreEkecU02d2iivO6
 9hYPMQKcWeE2DWwrZJxeDmv6PzmiiaCYyiV9dV5TgvsrIsF3CbDfRlcA0fCviy934T4g
 ZbahfEEHFQdD9f9EvVsZsPtMlufBFGeULiWEBSuS6Yw7Xg62r/422OK2OOgcTgLCwq23
 SblfXeyWVY0sBdZFdKr2IQaFPtDttsYswnubctYEUrsVPkkJsl29cCk/QhJT4NXxhVae
 GC8Y3Cg+mdZ573bgck3FXHkGuo0WQuCpWpdLsobI10wOFYUsufH7dXdlUIiArHCY+1VP
 8Vtw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1715095565; x=1715700365;
 h=to:subject:message-id:date:from:in-reply-to:references:mime-version
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=bztS6Rgqwj4DFMgeR5niFaWP01s0rif/ALdC+WLW0UY=;
 b=Ug4n+zxqx2D+LefmcDDmRH/7J3ol8+HvDrZffvAFfNf71BH8/9aKTUVbfO27sSEYfe
 qt/1dzV8l41yTUDQJBuiyUAmfSG3haP8FAeMvr3fRvpMmQKjAZK0aWFQziIPcweJjyoa
 USVYcJs7leWSXFxUeh3/XuXbD9AExOYtXnchFeu+b0JtestWW/X0+R+hfHveNwWMGHZZ
 if7I4GEoyUXqV6JTMk1Dgo7dScNjQDv0iuQWITKexIL2URB2AWj+v7by5Yt+iCFB06PM
 QR1mk2qy/AL6FpfsYoRTAyHu7RVvG+rGP/kHFTHiJRBS+I0AR8iPeNuR6lRmV7xB1YMg
 A8Ww==
X-Forwarded-Encrypted: i=1;
 AJvYcCWvEqJptabgsjyvjSDDSfqDtNHuqE6yusLR9vGBzpi8IZ1Les32FIlsQSj9/Ly/rPx5harvH/yUc7bYg835lULKVb6AqDXXtzLRuVA=
X-Gm-Message-State: AOJu0YzhRYjYb+XkKP9oGdPQhFgoxjbX4sNIL/UXhZMfiOqjlE7ruYCq
 BbZBVdruWxf4LuiU0dAQ6bVSx0y2MgC7RgeX9uKSg5b2ftYbxlRrR1x7zj5PhlwkVZvmt4bBz2w
 EvvjdZGIExRNd9Xt8BLda8FykBM9nWQ==
X-Google-Smtp-Source: AGHT+IHtwF50tVcWDtg21+xenGCLkYVbUBnWxM7PDwOdUOuEaL6GjgQyblxd9L3cpB/4e9+vsvjdaSiGmRVJzU4rIlQ=
X-Received: by 2002:a05:600c:46cf:b0:416:8efd:1645 with SMTP id
 5b1f17b1804b1-41f71302d2amr1607195e9.7.1715095564766; Tue, 07 May 2024
 08:26:04 -0700 (PDT)
MIME-Version: 1.0
References: <20240501151312.635565-1-tj@kernel.org>
In-Reply-To: <20240501151312.635565-1-tj@kernel.org>
From: Dave Taht <dave.taht@gmail.com>
Date: Mon, 6 May 2024 17:27:50 -0700
Message-ID: <CAA93jw7GBcJj+6RVuhVD0V-jxBbWmOVndzgzCRmjHnZx-1Mj3Q@mail.gmail.com>
To: libreqos <libreqos@lists.bufferbloat.net>,
 bloat <bloat@lists.bufferbloat.net>
Content-Type: multipart/alternative; boundary="000000000000e4d5e60617decdd5"
Subject: [Bloat] Fwd: [PATCHSET v6] sched: Implement BPF extensible
	scheduler class
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Tue, 07 May 2024 15:26:07 -0000

--000000000000e4d5e60617decdd5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

---------- Forwarded message ---------
From: Tejun Heo <tj@kernel.org>
Date: Wed, May 1, 2024, 8:13=E2=80=AFAM
Subject: [PATCHSET v6] sched: Implement BPF extensible scheduler class
To: <torvalds@linux-foundation.org>, <mingo@redhat.com>, <
peterz@infradead.org>, <juri.lelli@redhat.com>, <vincent.guittot@linaro.org=
>,
<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <
mgorman@suse.de>, <bristot@redhat.com>, <vschneid@redhat.com>, <
ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>, <
martin.lau@kernel.org>, <joshdon@google.com>, <brho@google.com>, <
pjt@google.com>, <derkling@google.com>, <haoluo@google.com>, <
dvernet@meta.com>, <dschatzberg@meta.com>, <dskarlat@cs.cmu.edu>, <
riel@surriel.com>, <changwoo@igalia.com>, <himadrics@inria.fr>, <
memxor@gmail.com>, <andrea.righi@canonical.com>, <joel@joelfernandes.org>
Cc: <linux-kernel@vger.kernel.org>, <bpf@vger.kernel.org>, <
kernel-team@meta.com>


Updates and Changes
-------------------

This is v6 of sched_ext (SCX) patchset.

During the past five months, both the development and adoption of sched_ext
have been progressing briskly. Here are some highlights around adoption:

- Valve has been working with Igalia to implement a sched_ext scheduler for
  Steam Deck. The development is still in its early stages but they are
  already happy with the results (more consistent FPS) and are planning to
  enable the scheduler on Steam Deck.

    https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_lavd

https://ossna2024.sched.com/event/1aBOT/optimizing-scheduler-for-linux-gami=
ng-changwoo-min-igalia

- Ubuntu is considering to include sched_ext in the upcoming 24.10 release.
  Andrea Righi of Canonical has been actively working on a userspace
  scheduling framework since the end of the last year.

    https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rustland

https://discourse.ubuntu.com/t/introducing-kernel-6-8-for-the-24-04-noble-n=
umbat-release/41958

- We (Meta) are now deploying a sched_ext scheduler on one large production
  workload (web), conducting wide-scale verification benchmark on another
  (ads), and preparing production deployment on yet another workload (ML
  training). These are all using scx_layered, which is useful for quick
  prototyping and experiments. In the process, we've identified several
  common strategies which are useful across multiple workloads (e.g.
  soft-affinitizing related threads) and are in the process of implementing
  something more generic and polished.

    https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_layered

- Because Google's ghOSt framework (userspace scheduling framework with BPF
  hooks for optimization) is already available in the Google fleet, that's
  what Google is currently experimenting with. They are seeing promising
  results in a couple important workloads (search and cloud hosting) and
  trying to move on to deployment. The gap between ghOSt and sched_ext is
  not wide at this point and Google is working to port ghOSt schedulers on
  top of sched_ext.

- ChromeOS is looking into scx_layered with focus on reducing latency (as a
  replacement to RT) with a prototype port of sched_ext on ChromeOS.

- Oculus is facing a number of scheduling related challenges and looking
  into sched_ext. They have an android port that they're experimenting with
  although any actual deployment would have to wait until a newer platform
  kernel can be rolled out which will take quite a while.

Although sched_ext is still out of tree, we're seeing wide interest and
adoption across multiple organizations and different use cases. Plus, our
first-hand experiences at Meta and the reports from other users definitivel=
y
confirm the hypothesized merits of sched_ext - among others, lowered barrie=
r
of entry coupled with rapid and safe experiments leading to insights and
performance gains in an easily deployable form.

For example, scx_rusty and scx_layered are proving substantial benefits of
better work conservation within L3 domains, soft affinity (flexibly groupin=
g
related threads together) and application specific prioritization for
request-driven server workloads. scx_lavd is demonstrating the benefits of =
a
new set of heuristics based on the runtime and the frequencies of waking up
others and being woken up for gaming, and possibly other interactive,
workloads.

We're seeing constantly increasing interests both within Meta and from the
wider community. The benefits are inherent and clear enough that I don't se=
e
a reason for the trend to change. However, being out of tree does add a lot
of overhead in terms of decision making and logistics for everyone involved=
.

Given that there already is substantial adoption which continues to grow an=
d
sched_ext doesn't affect the built-in schedulers or the rest of kernel in a=
n
invasive manner, I believe it's reasonable to consider sched_ext for
inclusion. I and David Vernet would be happy to run the tree, respond to bu=
g
reports, coordinate with scheduler core or any other kernel subsystem that
sched_ext may interact with.


If you're interested in high-level arguments for and against. Please refer
to the following discussion in the v4 posting:

  http://lkml.kernel.org/r/20230711011412.100319-1-tj@kernel.org


If you're interested in getting your hands dirty, the following repository
contains example and practical schedulers along with documentation on how t=
o
get started:

  https://github.com/sched-ext/scx

The kernel and scheduler packages are available for Ubuntu, CachyOS and Arc=
h
(through CachyOS repo). Fedora packaging is in the works.

There are also a slack and weekly office hour:

  https://schedextworkspace.slack.com

  Office Hour: Mondays at 16:00 UTC (8:00 AM PST, 16:00 UTC, 17:00 CEST,
               1:00 AM KST). Please see the #office-hours channel for the
               zoom invite.


The followings are significant changes from v5
(http://lkml.kernel.org/r/20231111024835.2164816-1-tj@kernel.org). For more
detailed list of changes, please refer to the patch descriptions.

- scx_pair, scx_userland, scx_next and the rust schedulers are removed from
  kernel tree and now hosted in https://github.com/sched-ext/scx along with
  all other schedulers.

- SCX_OPS_DISABLING state is replaced with the new bypass mechanism which
  allows temporarily putting the system into simple FIFO scheduling mode. I=
n
  addition to the shut down path, this is used to isolate the BPF scheduler
  across power management events.

- ops.prep_enable() is replaced with ops.init_task() and
  ops.enable/disable() are now called whenever the task enters and leaves
  sched_ext instead of when the task becomes schedulable on sched_ext and
  stops being so. A new operation - ops.exit_task() - is called when the
  task stops being schedulable on sched_ext.

- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
  removes the need for communicating local dispatch decision made by
  ops.select_cpu() to ops.enqueue() via per-task storage.

- SCX_TASK_ENQ_LOCAL which told the BPF scheduler that scx_select_cpu_dfl()
  wants the task to be dispatched to the local DSQ was removed. Instead,
  scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable
  idle CPU. If such behavior is not desired, users can use
  scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param.

- Dispatch decisions made in ops.dispatch() may now be cancelled with a new
  scx_bpf_dispatch_cancel() kfunc.

- A new SCX_KICK_IDLE flag is available for use with scx_bpf_kick_cpu() to
  only send a resched IPI if the target CPU is idle.

- exit_code added to scx_exit_info. This is used to indicate different exit
  conditions on non-error exits and enables e.g. handling CPU hotplugs by
  restarting the scheduler.

- Debug dump added. When the BPF scheduler gets aborted, the states of all
  runqueues and runnable tasks are captured and sent to the scheduler binar=
y
  to aid debugging. See https://github.com/sched-ext/scx/issues/234 for an
  example of the debug dump being used to root cause a bug in scx_lavd.

- The BPF scheduler can now iterate DSQs and consume specific tasks.

- CPU frequency scaling support added through cpuperf kfunc interface.

- The current state of sched_ext can now be monitored through files under
  /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable
  monitoring on kernels which don't enable debugfs. A drgn script
  tools/sched_ext/scx_show_state.py is added for additional visibility.

- tools/sched_ext/include/scx/compat[.bpf].h and other facilities to allow
  schedulers to be loaded on older kernels are added. The current tentative
  target is maintaining backward compatibility for at least one major kerne=
l
  release where reasonable.

- Code reorganized so that only the parts necessary to integrate with the
  rest of the kernel are in the header files.


Overview
--------

This patch set proposes a new scheduler class called =E2=80=98ext_sched_cla=
ss=E2=80=99, or
sched_ext, which allows scheduling policies to be implemented as BPF
programs.

More details will be provided on the overall architecture of sched_ext
throughout the various patches in this set, as well as in the =E2=80=9CHow=
=E2=80=9D section
below. We realize that this patch set is a significant proposal, so we will
be
going into depth in the following =E2=80=9CMotivation=E2=80=9D section to e=
xplain why we
think
it=E2=80=99s justified. That section is laid out as follows, touching on th=
ree main
axes where we believe that sched_ext provides significant value:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
   scheduling policies.

2. Customization: Building application-specific schedulers which implement
   policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
   policies in production environments.

After the motivation section, we=E2=80=99ll provide a more detailed (but st=
ill
high-level) overview of how sched_ext works.


Motivation
----------

1. Ease of experimentation and exploration

*Why is exploration important?*

Scheduling is a challenging problem space. Small changes in scheduling
behavior can have a significant impact on various components of a system,
with
the corresponding effects varying widely across different platforms,
architectures, and workloads.

While complexities have always existed in scheduling, they have increased
dramatically over the past 10-15 years. In the mid-late 2000s, cores were
typically homogeneous and further apart from each other, with the criteria
for
scheduling being roughly the same across the entire die.

Systems in the modern age are by comparison much more complex. Modern CPU
designs, where the total power budget of all CPU cores often far exceeds th=
e
power budget of the socket, with dynamic frequency scaling, and with or
without chiplets, have significantly expanded the scheduling problem space.
Cache hierarchies have become less uniform, with Core Complex (CCX) designs
such as recent AMD processors having multiple shared L3 caches within a
single
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.

Use-cases have become increasingly complex and diverse as well. Application=
s
such as mobile and VR have strict latency requirements to avoid missing
deadlines that impact user experience. Stacking workloads in servers is
constantly pushing the demands on the scheduler in terms of workload
isolation
and resource distribution.

Experimentation and exploration are important for any non-trivial problem
space. However, given the recent hardware and software developments, we
believe that experimentation and exploration are not just important, but
_critical_ in the scheduling problem space.

Indeed, other approaches in industry are already being explored. AMD has
proposed an experimental patch set [0] which enables userspace to provide
hints to the scheduler via =E2=80=9CUserspace Hinting=E2=80=9D. The approac=
h adds a prctl()
API which allows callers to set a numerical =E2=80=9Chint=E2=80=9D value on=
 a struct
task_struct. This hint is then optionally read by the scheduler to adjust
the
cost calculus for various scheduling decisions.

[0]:
https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/

Huawei have also expressed interest [1] in enabling some form of
programmable
scheduling. While we=E2=80=99re unaware of any patch sets which have been s=
ent to
the
upstream list for this proposal, it similarly illustrates the need for more
flexibility in the scheduler.

[1]:
https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com=
/

Additionally, Google has developed ghOSt [2] with the goal of enabling
custom,
userspace driven scheduling policies. Prior presentations at LPC [3] have
discussed ghOSt and how BPF can be used to accelerate scheduling.

[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
[3]: https://lpc.events/event/16/contributions/1365/

*Why can=E2=80=99t we just explore directly with CFS?*

Experimenting with CFS directly or implementing a new sched_class from
scratch
is of course possible, but is often difficult and time consuming. Newcomers
to
the scheduler often require years to understand the codebase and become
productive contributors. Even for seasoned kernel engineers, experimenting
with and upstreaming features can take a very long time. The iteration
process
itself is also time consuming, as testing scheduler changes on real hardwar=
e
requires reinstalling the kernel and rebooting the host.

Core scheduling is an example of a feature that took a significant amount o=
f
time and effort to integrate into the kernel. Part of the difficulty with
core
scheduling was the inherent mismatch in abstraction between the desire to
perform core-wide scheduling, and the per-cpu design of the kernel
scheduler.
This caused issues, for example ensuring proper fairness between the
independent runqueues of SMT siblings.

The high barrier to entry for working on the scheduler is an impediment to
academia as well. Master=E2=80=99s/PhD candidates who are interested in imp=
roving
the
scheduler will spend years ramping-up, only to complete their degrees just
as
they=E2=80=99re finally ready to make significant changes. A lower entrance=
 barrier
would allow researchers to more quickly ramp up, test out hypotheses, and
iterate on novel ideas. Research methodology is also severely hampered by
the
high barrier of entry to make modifications; for example, the Shenango [4]
and
Shinjuku scheduling policies used sched affinity to replicate the desired
policy semantics, due to the difficulty of incorporating these policies int=
o
the kernel directly.

[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf

The iterative process itself also imposes a significant cost to working on
the
scheduler. Testing changes requires developers to recompile and reinstall
the
kernel, reboot their machines, rewarm their workloads, and then finally
rerun
their benchmarks. Though some of this overhead could potentially be
mitigated
by enabling schedulers to be implemented as kernel modules, a machine crash
or
subtle system state corruption is always only one innocuous mistake away.
These problems are exacerbated when testing production workloads in a
datacenter environment as well, where multiple hosts may be involved in an
experiment; requiring a significantly longer ramp up time. Warming up
memcache
instances in the Meta production environment takes hours, for example.

*How does sched_ext help with exploration?*

sched_ext attempts to address all of the problems described above. In this
section, we=E2=80=99ll describe the benefits to experimentation and explora=
tion that
are afforded by sched_ext, provide real-world examples of those benefits,
and
discuss some of the trade-offs and considerations in our design choices.

One of our main goals was to lower the barrier to entry for experimenting
with the scheduler. sched_ext provides ergonomic callbacks and helpers to
ease common operations such as managing idle CPUs, scheduling tasks on
arbitrary CPUs, handling preemptions from other scheduling classes, and
more. While sched_ext does require some ramp-up, the complexity is
self-contained, and the learning curve gradual. Developers can ramp up by
first implementing simple policies such as global weighted vtime scheduling
in only tens of lines of code, and then continue to learn the APIs and
building blocks available with sched_ext as they build more featureful and
complex schedulers.

Another critical advantage provided by sched_ext is the use of BPF. BPF
provides strong safety guarantees by statically analyzing programs at load
time to ensure that they cannot corrupt or crash the system. sched_ext
guarantees system integrity no matter what BPF scheduler is loaded, and
provides mechanisms to safely disable the current BPF scheduler and migrate
tasks back to a trusted scheduler. For example, we also implement in-kernel
safety mechanisms to guarantee that a misbehaving scheduler cannot
indefinitely starve tasks. BPF also enables sched_ext to significantly
improve
iteration speed for running experiments. Loading and unloading a BPF
scheduler
is simply a matter of running and terminating a sched_ext binary.

BPF also provides programs with a rich set of APIs, such as maps, kfuncs,
and BPF helpers. In addition to providing useful building blocks to program=
s
that run entirely in kernel space (such as many of our example schedulers),
these APIs also allow programs to leverage user space in making scheduling
decisions. Specifically, the Atropos sample scheduler has a relatively
simple weighted vtime or FIFO scheduling layer in BPF, paired with a load
balancing component in userspace written in Rust. As described in more
detail below, we also built a more general user-space scheduling framework
called =E2=80=9Crhone=E2=80=9D by leveraging various BPF features.

On the other hand, BPF does have shortcomings, as can be plainly seen from
the complexity in some of the example schedulers. scx_pair.bpf.c illustrate=
s
this point well. To start, it requires a good amount of code to emulate
cgroup-local-storage. In the kernel proper, this would simply be a matter o=
f
adding another pointer to the struct cgroup, but in BPF, it requires a
complex juggling of data amongst multiple different maps, a good amount of
boilerplate code, and some unwieldy bpf_loop()=E2=80=98s and atomics. The c=
ode is
also littered with explicit and often unnecessary sanity checks to appease
the verifier.

That being said, BPF is being rapidly improved. For example, Yonghong Song
recently upstreamed a patch set [5] to add a cgroup local storage map type,
allowing scx_pair.bpf.c to be simplified. There are plans to address other
issues as well, such as providing statically-verified locking, and avoiding
the need for unnecessary sanity checks. Addressing these shortcomings is a
high priority for BPF, and as progress continues to be made, we expect most
deficiencies to be addressed in the not-too-distant future.

[5]: https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.com/

Yet another exploration advantage of sched_ext is helping widening the scop=
e
of experiments. For example, sched_ext makes it easy to defer CPU assignmen=
t
until a task starts executing, allowing schedulers to share scheduling
queues
at any granularity (hyper-twin, CCX and so on). Additionally, higher level
frameworks can be built on top to further widen the scope. For example, the
aforementioned =E2=80=9Crhone=E2=80=9D [6] library allows implementing sche=
duling policies
in
user-space by encapsulating the complexity around communicating scheduling
decisions with the kernel. This allows taking advantage of a richer
programming environment in user-space, enabling experimenting with, for
instance, more complex mathematical models.

[6]: https://github.com/Decave/rhone

sched_ext also allows developers to leverage machine learning. At Meta, we
experimented with using machine learning to predict whether a running task
would soon yield its CPU. These predictions can be used to aid the schedule=
r
in deciding whether to keep a runnable task on its current CPU rather than
migrating it to an idle CPU, with the hope of avoiding unnecessary cache
misses. Using a tiny neural net model with only one hidden layer of size 16=
,
and a decaying count of 64 syscalls as a feature, we were able to achieve a
15% throughput improvement on an Nginx benchmark, with an 87% inference
accuracy.

2. Customization

This section discusses how sched_ext can enable users to run workloads on
application-specific schedulers.

*Why deploy custom schedulers rather than improving CFS?*

Implementing application-specific schedulers and improving CFS are not
conflicting goals. Scheduling features explored with sched_ext which yield
beneficial results, and which are sufficiently generalizable, can and shoul=
d
be integrated into CFS. However, CFS is fundamentally designed to be a
general
purpose scheduler, and thus is not conducive to being extended with some
highly targeted application or hardware specific changes.

Targeted, bespoke scheduling has many potential use cases. For example, VM
scheduling can make certain optimizations that are infeasible in CFS due to
the constrained problem space (scheduling a static number of long-running
VCPUs versus an arbitrary number of threads). Additionally, certain
applications might want to make targeted policy decisions based on hints
directly from the application (for example, a service that knows the
different
deadlines of incoming RPCs).

Google has also experimented with some promising, novel scheduling policies=
.
One example is =E2=80=9Ccentral=E2=80=9D scheduling, wherein a single CPU m=
akes all
scheduling decisions for the entire system. This allows most cores on the
system to be fully dedicated to running workloads, and can have significant
performance improvements for certain use cases. For example, central
scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
instead delegating the responsibility of preemption checks from the tick to
a single CPU. See scx_central.bpf.c for a simple example of a central
scheduling policy built in sched_ext.

Some workloads also have non-generalizable constraints which enable
optimizations in a scheduling policy which would otherwise not be feasible.
For example,VM workloads at Google typically have a low overcommit ratio
compared to the number of physical CPUs. This allows the scheduler to
support
bounded tail latencies, as well as longer blocks of uninterrupted time.

Yet another interesting use case is the scx_flatcg scheduler, which is in
0024-sched_ext-Add-cgroup-support.patch and provides a flattened
hierarchical vtree for cgroups. This scheduler does not account for
thundering herd problems among cgroups, and therefore may not be suitable
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
serving a CGI script calculating sha1sum of a small file, it outperformed
CFS by ~3% with CPU controller disabled and by ~10% with two apache
instances competing with 2:1 weight ratio nested four level deep.

[7] https://github.com/wg/wrk

Certain industries require specific scheduling behaviors that do not apply
broadly. For example, ARINC 653 defines scheduling behavior that is widely
used by avionic software, and some out-of-tree implementations
(https://ieeexplore.ieee.org/document/7005306) have been built. While the
upstream community may decide to merge one such implementation in the
future,
it would also be entirely reasonable to not do so given the narrowness of
use-case, and non-generalizable, strict requirements. Such cases can be wel=
l
served by sched_ext in all stages of the software development lifecycle --
development, testing, deployment and maintenance.

There are also classes of policy exploration, such as machine learning, or
responding in real-time to application hints, that are significantly harder
(and not necessarily appropriate) to integrate within the kernel itself.

*Won=E2=80=99t this increase fragmentation?*

We acknowledge that to some degree, sched_ext does run the risk of
increasing the fragmentation of scheduler implementations. As a result of
exploration, however, we believe that enabling the larger ecosystem to
innovate will ultimately accelerate the overall development and performance
of Linux.

BPF programs are required to be GPLv2, which is enforced by the verifier on
program loads. With regards to API stability, just as with other
semi-internal
interfaces such as BPF kfuncs, we won=E2=80=99t be providing any API stabil=
ity
guarantees to BPF schedulers. While we intend to make an effort to provide
compatibility when possible, we will not provide any explicit, strong
guarantees as the kernel typically does with e.g. UAPI headers. For users
who
decide to keep their schedulers out-of-tree,the licensing and maintenance
overheads will be fundamentally the same as for carrying out-of-tree
patches.

With regards to the schedulers included in this patch set, and any other
schedulers we implement in the future, both Meta and Google will open-sourc=
e
all of the schedulers we implement which have any relevance to the broader
upstream community. We expect that some of these, such as the simple exampl=
e
schedulers and scx_rusty scheduler, will be upstreamed as part of the kerne=
l
tree. Distros will be able to package and release these schedulers with the
kernel, allowing users to utilize these schedulers out-of-the-box without
requiring any additional work or dependencies such as clang or building the
scheduler programs themselves. Other schedulers and scheduling frameworks
such as rhone may be open-sourced through separate per-project repos.

3. Rapid scheduler deployments

Rolling out kernel upgrades is a slow and iterative process. At a large
scale
it can take months to roll a new kernel out to a fleet of servers. While
this
latency is expected and inevitable for normal kernel upgrades, it can becom=
e
highly problematic when kernel changes are required to fix bugs. Livepatch
[8]
is available to quickly roll out critical security fixes to large fleets,
but
the scope of changes that can be applied with livepatching is fairly
limited,
and would likely not be usable for patching scheduling policies. With
sched_ext, new scheduling policies can be rapidly rolled out to production
environments.

[8]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html

As an example, one of the variants of the L1 Terminal Fault (L1TF) [9]
vulnerability allows a VCPU running a VM to read arbitrary host kernel
memory for pages in L1 data cache. The solution was to implement core
scheduling, which ensures that tasks running as hypertwins have the same
=E2=80=9Ccookie=E2=80=9D.

[9]:
https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.ht=
ml

While core scheduling works well, it took a long time to finalize and land
upstream. This long rollout period was painful, and required organizations
to
make difficult choices amongst a bad set of options. Some companies such as
Google chose to implement and use their own custom L1TF-safe scheduler,
others
chose to run without hyper-threading enabled, and yet others left
hyper-threading enabled and crossed their fingers.

Once core scheduling was upstream, organizations had to upgrade the kernels
on
their entire fleets. As downtime is not an option for many, these upgrades
had
to be gradually rolled out, which can take a very long time for large
fleets.

An example of an sched_ext scheduler that illustrates core scheduling
semantics is scx_pair.bpf.c, which co-schedules pairs of tasks from the sam=
e
cgroup, and is resilient to L1TF vulnerabilities. While this example
scheduler is certainly not suitable for production in its current form, a
similar scheduler that is more performant and featureful could be written
and deployed if necessary.

Rapid scheduling deployments can similarly be useful to quickly roll-out ne=
w
scheduling features without requiring kernel upgrades. At Google, for
example,
it was observed that some low-priority workloads were causing degraded
performance for higher-priority workloads due to consuming a
disproportionate
share of memory bandwidth. While a temporary mitigation was to use sched
affinity to limit the footprint of this low-priority workload to a small
subset of CPUs, a preferable solution would be to implement a more
featureful
task-priority mechanism which automatically throttles lower-priority tasks
which are causing memory contention for the rest of the system. Implementin=
g
this in CFS and rolling it out to the fleet could take a very long time.

sched_ext would directly address these gaps. If another hardware bug or
resource contention issue comes in that requires scheduler support to
mitigate, sched_ext can be used to experiment with and test different
policies. Once a scheduler is available, it can quickly be rolled out to as
many hosts as necessary, and function as a stop-gap solution until a
longer-term mitigation is upstreamed.


How
---

sched_ext is a new sched_class which allows scheduling policies to be
implemented in BPF programs.

sched_ext leverages BPF=E2=80=99s struct_ops feature to define a structure =
which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct
sched_ext_ops, and is conceptually similar to struct sched_class. The role
of
sched_ext is to map the complex sched_class callbacks to the more simple an=
d
ergonomic struct sched_ext_ops callbacks.

Unlike some other BPF program types which have ABI requirements due to
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This
provides
us with the flexibility to change the APIs provided to schedulers as
necessary. BPF struct_ops is also already being used successfully in other
subsystems, such as in support of TCP congestion control.

The only struct_ops field that is required to be specified by a scheduler i=
s
the =E2=80=98name=E2=80=99 field. Otherwise, sched_ext will provide sane de=
fault behavior,
such as automatically choosing an idle CPU on the task wakeup path if
.select_cpu() is missing.

*Dispatch queues*

To bridge the workflow imbalance between the scheduler core and
sched_ext_ops
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU ds=
q
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
used by a scheduler that doesn't require it. As described in more detail
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
putting the next task on the CPU. The BPF scheduler can manage an arbitrary
number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

*Scheduling cycle*

The following briefly shows a typical workflow for how a waking task is
scheduled and executed.

1. When a task is waking up, .select_cpu() is the first operation invoked.
   This serves two purposes. It both allows a scheduler to optimize task
   placement by specifying a CPU where it expects the task to eventually be
   scheduled, and the latter is that the selected CPU will be woken if it=
=E2=80=99s
   idle.

2. Once the target CPU is selected, .enqueue() is invoked. It can make one
of
   the following decisions:

   - Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL=
)
     or the current CPU=E2=80=99s local dsq (SCX_DSQ_LOCAL).

   - Immediately dispatch the task to a user-created dispatch queue.

   - Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
     scheduler, with the intention of dispatching it at a later time from
     .dispatch().

3. When a CPU is ready to schedule, it first looks at its local dsq. If
empty,
   it invokes .consume() which should make one or more scx_bpf_consume()
calls
   to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the
CPU
   has the next task to run and .consume() can return. If .consume() is not
   defined, sched_ext will by-default consume from only the built-in
   SCX_DSQ_GLOBAL dsq.

4. If there's still no task to run, .dispatch() is invoked which should mak=
e
   one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
   scheduler to one of the dsq's. If more than one task has been dispatched=
,
   go back to the previous consumption step.

*Verifying callback behavior*

sched_ext always verifies that any value returned from a callback is valid,
and will issue an error and unload the scheduler if it is not. For example,
if
.select_cpu() returns an invalid CPU, or if an attempt is made to invoke th=
e
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task
remains
runnable for too long without being scheduled, sched_ext will detect it and
error-out the scheduler.


Closing Thoughts
----------------

Both Meta and Google have experimented quite a lot with schedulers in the
last several years. Google has benchmarked various workloads using user
space scheduling, and have achieved performance wins by trading off
generality for application specific needs. At Meta, we are actively
experimenting with multiple production workloads and seeing significant
performance gains, and are in the process of deploying sched_ext schedulers
on production workloads at scale. We expect to leverage it extensively to
run various experiments and develop customized schedulers for a number of
critical workloads.

In closing, both Meta and Google believe that sched_ext will significantly
evolve how the broader community explores the scheduling problem space,
while also enabling targeted policies for custom applications. We=E2=80=99l=
l be able
to experiment easier and faster, explore uncharted areas, and deploy
emergency scheduler changes when necessary. The same applies to anyone who
wants to work on the scheduler, including academia and specialized
industries. sched_ext will push forward the state of the art when it comes
to scheduling and performance in Linux.


Written By
----------

David Vernet <dvernet@meta.com>
Josh Don <joshdon@google.com>
Tejun Heo <tj@kernel.org>
Barret Rhoden <brho@google.com>


Supported By
------------

Paul Turner <pjt@google.com>
Neel Natu <neelnatu@google.com>
Patrick Bellasi <derkling@google.com>
Hao Luo <haoluo@google.com>
Dimitrios Skarlatos <dskarlat@cs.cmu.edu>


Patchset
--------

This patchset is on top of bpf/for-next as of 2024-04-29:

  07801a24e2f1 ("bpf, docs: Clarify PC use in instruction-set.rst")

and contains the following patches:

NOTE: The doc added by 0038 contains a high-level overview and might be goo=
d
      place to start.

0001-cgroup-Implement-cgroup_show_cftypes.patch
0002-sched-Restructure-sched_class-order-sanity-checks-in.patch
0003-sched-Allow-sched_cgroup_fork-to-fail-and-introduce-.patch
0004-sched-Add-sched_class-reweight_task.patch
0005-sched-Add-sched_class-switching_to-and-expose-check_.patch
0006-sched-Factor-out-cgroup-weight-conversion-functions.patch
0007-sched-Expose-css_tg-and-__setscheduler_prio.patch
0008-sched-Enumerate-CPU-cgroup-file-types.patch
0009-sched-Add-reason-to-sched_class-rq_-on-off-line.patch
0010-sched-Factor-out-update_other_load_avgs-from-__updat.patch
0011-cpufreq_schedutil-Refactor-sugov_cpu_is_busy.patch
0012-sched-Add-normal_policy.patch
0013-sched_ext-Add-boilerplate-for-extensible-scheduler-c.patch
0014-sched_ext-Implement-BPF-extensible-scheduler-class.patch
0015-sched_ext-Add-scx_simple-and-scx_example_qmap-exampl.patch
0016-sched_ext-Add-sysrq-S-which-disables-the-BPF-schedul.patch
0017-sched_ext-Implement-runnable-task-stall-watchdog.patch
0018-sched_ext-Allow-BPF-schedulers-to-disallow-specific-.patch
0019-sched_ext-Print-sched_ext-info-when-dumping-stack.patch
0020-sched_ext-Print-debug-dump-after-an-error-exit.patch
0021-tools-sched_ext-Add-scx_show_state.py.patch
0022-sched_ext-Implement-scx_bpf_kick_cpu-and-task-preemp.patch
0023-sched_ext-Add-a-central-scheduler-which-makes-all-sc.patch
0024-sched_ext-Make-watchdog-handle-ops.dispatch-looping-.patch
0025-sched_ext-Add-task-state-tracking-operations.patch
0026-sched_ext-Implement-tickless-support.patch
0027-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch
0028-sched_ext-Add-cgroup-support.patch
0029-sched_ext-Add-a-cgroup-scheduler-which-uses-flattene.patch
0030-sched_ext-Implement-SCX_KICK_WAIT.patch
0031-sched_ext-Implement-sched_ext_ops.cpu_acquire-releas.patch
0032-sched_ext-Implement-sched_ext_ops.cpu_online-offline.patch
0033-sched_ext-Bypass-BPF-scheduler-while-PM-events-are-i.patch
0034-sched_ext-Implement-core-sched-support.patch
0035-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch
0036-sched_ext-Implement-DSQ-iterator.patch
0037-sched_ext-Add-cpuperf-support.patch
0038-sched_ext-Documentation-scheduler-Document-extensibl.patch
0039-sched_ext-Add-selftests.patch

0001     : Cgroup prep.

0002-0012: Scheduler prep.

0013-0015: sched_ext core implementation and a couple example BPF scheduler=
.

0016-0021: Utility features including safety mechanisms, switch-all and
           printing sched_ext state when dumping backtraces.

0022-0027: Kicking and preempting other CPUs, task state transition trackin=
g
           and tickless support. Demonstrated with an example central
           scheduler which makes all scheduling decisions on one CPU.

0029-0030: cgroup support and the ability to wait for other CPUs after
           kicking them.

0031-0033: Add CPU preemption and hotplug and power-management support.

0034     : Add core-sched support.

0035-0036: Add DSQ rbtree and iterator support.

0037     : Add cpuperf (frequency scaling) support.

0038     : Add documentation.

0039     : Add selftests.

The patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git sched_ext-v=
6

diffstat follows.

 Documentation/scheduler/index.rst                                   |    1
 Documentation/scheduler/sched-ext.rst                               |  307
+
 MAINTAINERS                                                         |   13
 Makefile                                                            |    8
 drivers/tty/sysrq.c                                                 |    1
 include/asm-generic/vmlinux.lds.h                                   |    1
 include/linux/cgroup-defs.h                                         |    8
 include/linux/cgroup.h                                              |    5
 include/linux/sched.h                                               |    5
 include/linux/sched/ext.h                                           |  210
 include/linux/sched/task.h                                          |    3
 include/uapi/linux/sched.h                                          |    1
 init/Kconfig                                                        |    5
 init/init_task.c                                                    |   12
 kernel/Kconfig.preempt                                              |   24
 kernel/cgroup/cgroup.c                                              |   97
 kernel/fork.c                                                       |   17
 kernel/sched/build_policy.c                                         |    8
 kernel/sched/core.c                                                 |  324
+
 kernel/sched/cpufreq_schedutil.c                                    |   50
 kernel/sched/deadline.c                                             |    4
 kernel/sched/debug.c                                                |    3
 kernel/sched/ext.c                                                  | 6641
+++++++++++++++++++++++++++++++
 kernel/sched/ext.h                                                  |  139
 kernel/sched/fair.c                                                 |   25
 kernel/sched/idle.c                                                 |    2
 kernel/sched/rt.c                                                   |    4
 kernel/sched/sched.h                                                |  123
 kernel/sched/topology.c                                             |    4
 lib/dump_stack.c                                                    |    1
 tools/Makefile                                                      |   10
 tools/sched_ext/.gitignore                                          |    2
 tools/sched_ext/Makefile                                            |  246
+
 tools/sched_ext/README.md                                           |  270
+
 tools/sched_ext/include/bpf-compat/gnu/stubs.h                      |   11
 tools/sched_ext/include/scx/common.bpf.h                            |  301
+
 tools/sched_ext/include/scx/common.h                                |   71
 tools/sched_ext/include/scx/compat.bpf.h                            |  110
 tools/sched_ext/include/scx/compat.h                                |  197
 tools/sched_ext/include/scx/user_exit_info.h                        |  111
 tools/sched_ext/scx_central.bpf.c                                   |  361
+
 tools/sched_ext/scx_central.c                                       |  135
 tools/sched_ext/scx_flatcg.bpf.c                                    |  939
++++
 tools/sched_ext/scx_flatcg.c                                        |  233
+
 tools/sched_ext/scx_flatcg.h                                        |   51
 tools/sched_ext/scx_qmap.bpf.c                                      |  673
+++
 tools/sched_ext/scx_qmap.c                                          |  150
 tools/sched_ext/scx_show_state.py                                   |   39
 tools/sched_ext/scx_simple.bpf.c                                    |  149
 tools/sched_ext/scx_simple.c                                        |  107
 tools/testing/selftests/sched_ext/.gitignore                        |    6
 tools/testing/selftests/sched_ext/Makefile                          |  216
+
 tools/testing/selftests/sched_ext/config                            |    9
 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c         |   42
 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c             |   57
 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c        |   39
 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c            |   56
 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c       |   21
 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c           |   60
 tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c        |   43
 tools/testing/selftests/sched_ext/enq_select_cpu_fails.c            |   61
 tools/testing/selftests/sched_ext/exit.bpf.c                        |   84
 tools/testing/selftests/sched_ext/exit.c                            |   55
 tools/testing/selftests/sched_ext/exit_test.h                       |   20
 tools/testing/selftests/sched_ext/hotplug.bpf.c                     |   55
 tools/testing/selftests/sched_ext/hotplug.c                         |  168
 tools/testing/selftests/sched_ext/hotplug_test.h                    |   15
 tools/testing/selftests/sched_ext/init_enable_count.bpf.c           |   53
 tools/testing/selftests/sched_ext/init_enable_count.c               |  166
 tools/testing/selftests/sched_ext/maximal.bpf.c                     |  164
 tools/testing/selftests/sched_ext/maximal.c                         |   51
 tools/testing/selftests/sched_ext/maybe_null.bpf.c                  |   26
 tools/testing/selftests/sched_ext/maybe_null.c                      |   40
 tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c             |   25
 tools/testing/selftests/sched_ext/minimal.bpf.c                     |   21
 tools/testing/selftests/sched_ext/minimal.c                         |   58
 tools/testing/selftests/sched_ext/prog_run.bpf.c                    |   32
 tools/testing/selftests/sched_ext/prog_run.c                        |   78
 tools/testing/selftests/sched_ext/reload_loop.c                     |   75
 tools/testing/selftests/sched_ext/runner.c                          |  201
 tools/testing/selftests/sched_ext/scx_test.h                        |  131
 tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c              |   40
 tools/testing/selftests/sched_ext/select_cpu_dfl.c                  |   72
 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c   |   89
 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c       |   72
 tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c         |   41
 tools/testing/selftests/sched_ext/select_cpu_dispatch.c             |   70
 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c |   37
 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c     |   56
 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c |   38
 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c     |   56
 tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c            |   92
 tools/testing/selftests/sched_ext/select_cpu_vtime.c                |   59
 tools/testing/selftests/sched_ext/test_example.c                    |   49
 tools/testing/selftests/sched_ext/util.c                            |   71
 tools/testing/selftests/sched_ext/util.h                            |   13
 96 files changed, 15056 insertions(+), 139 deletions(-)


Patchset History
----------------

v4 (http://lkml.kernel.org/r/20230711011412.100319-1-tj@kernel.org) -> v5:

- Updated to rebase on top of the current bpf/for-next (2023-11-06).
  '0002-0010: Scheduler prep' were simply rebased on top of new EEVDF
  scheduler which demonstrate clean cut API boundary between sched-ext and
  sched core.

- To accommodate 32bit configs, fields which use atomic ops and
  store_release/load_acquire are switched from 64bits to longs.

- To help triaging, if sched_ext is enabled, backtrace dumps now show the
  currently active scheduler along with some debug information.

- Fixes for bugs including p->scx.flags corruption due to unsynchronized
  SCX_TASK_DSQ_ON_PRIQ changes, and overly permissive BTF struct and scx_bp=
f
  kfunc access checks.

- Other misc changes including renaming "type" to "kind" in scx_exit_info t=
o
  ease usage from rust and other languages in which "type" is a reserved
  keyword.

- scx_atropos is renamed to scx_rusty and received signficant updates to
  improve scalability. Load metrics are now tracked in BPF and accessed onl=
y
  as necessary from userspace.

- Misc example scheduler improvements including the usage of resizable BPF
  .bss array, the introduction of SCX_BUG[_ON](), and timer CPU pinning in
  scx_central.

- Improve Makefile and documentation for example schedulers.

v3 (https://lkml.kernel.org/r/20230317213333.2174969-1-tj@kernel.org) -> v4=
:

- There aren't any significant changes to the sched_ext API even though we
  kept experimenting heavily with a couple BPF scheduler implementations
  indicating that the core API reached a level of maturity.

- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch which
  implemented custom guard scope for scheduler attribute changes dropped as
  upstream is moving towards a more generic implementation.

- Build fixes with different CONFIG combinations.

- Core code cleanups and improvements including how idle CPU is selected an=
d
  disabling ttwu_queue for tasks on SCX to avoid confusing BPF schedulers
  expecting ->select_cpu() call. See
  0012-sched_ext-Implement-BPF-extensible-scheduler-class.patch for more
  details.

- "_example" dropped from the example schedulers as the distinction between
  the example-only and practically-useful isn't black-and-white. Instead,
  each scheduler has detailed comments and there's also a README file.

- scx_central, scx_pair and scx_flatcg are moved into their own patches as
  suggested by Josh Don.

- scx_atropos received sustantial updates including fixes for bugs that
  could cause temporary stalls and improvements in load balancing and wakeu=
p
  target CPU selection. For details, See
  0034-sched_ext-Add-a-rust-userspace-hybrid-example-schedu.patch.

v2 (http://lkml.kernel.org/r/20230128001639.3510083-1-tj@kernel.org) -> v3:

- ops.set_weight() added to allow BPF schedulers to track weight changes
  without polling p->scx.weight.

- scx_bpf_task_cgroup() kfunc added to allow BPF scheduler to reliably
  determine the current cpu cgroup under rq lock protection. This required
  improving the kf_mask SCX operation verification mechanism and adding
  0023-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch.

- Updated to use the latest BPF improvements including KF_RCU and the inlin=
e
  iterator.

- scx_example_flatcg added to 0024-sched_ext-Add-cgroup-support.patch. It
  uses the new BPF RB tree support to implement flattened cgroup hierarchy.

- A DSQ now also contains an rbtree so that it can be used to implement
  vtime based scheduling among tasks sharing a DSQ conveniently and
  efficiently. For more details, see
  0029-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch. All
  eligible example schedulers are updated to default to weighted vtime
  scheduilng.

- atropos scheduler's userspace code is substantially restructred and
  rewritten. The binary is renamed to scx_atropos and can auto-config the
  domains according to the cache topology.

- Various other example scheduler updates including scx_example_dummy being
  renamed to scx_example_simple, the example schedulers defaulting to
  enabling switch_all and clarifying performance expectation of each exampl=
e
  scheduler.

- A bunch of fixes and improvements. Please refer to each patch for details=
.

v1 (http://lkml.kernel.org/r/20221130082313.3241517-1-tj@kernel.org) -> v2:

- Rebased on top of bpf/for-next - a5f6b9d577eb ("Merge branch 'Enable
  struct_ops programs to be sleepable'"). There were several missing
  features including generic cpumask helpers and sleepable struct_ops
  operation support that v1 was working around. The rebase gets rid of all
  SCX specific temporary helpers.

- Some kfunc helpers are context-sensitive and can only be called from
  specific operations. v1 didn't restrict kfunc accesses allowing them to b=
e
  misused which can lead to crashes and other malfunctions. v2 makes more
  kfuncs safe to be called from anywhere and implements per-task mask based
  runtime access control for the rest. The longer-term plan is to make the
  BPF verifier enforce these restrictions. Combined with the above, sans
  mistakes and bugs, it shouldn't be possible to crash the machine through
  SCX and its helpers.

- Core-sched support. While v1 implemented the pick_task operation, there
  were multiple missing pieces for working core-sched support. v2 adds
  0027-sched_ext-Implement-core-sched-support.patch. SCX by default
  implements global FIFO ordering and allows the BPF schedulers to implemen=
t
  custom ordering via scx_ops.core_sched_before(). scx_example_qmap is
  updated so that the five queues' relative priorities are correctly
  reflected when core-sched is enabled.

- Dropped balance_scx_on_up() which was called from put_prev_task_balance()=
.
  UP support is now contained in SCX proper.

- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch adds
  SCHED_CHANGE_BLOCK() which encapsulates the preparation and restoration
  sequences used for task attribute changes. For SCX, this replaces
  sched_deq_and_put_task() and sched_enq_and_set_task() from v1.

- 0011-sched-Add-reason-to-sched_move_task.patch dropped from v1. SCX now
  distinguishes cgroup and autogroup tg's using task_group_is_autogroup().

- Other misc changes including fixes for bugs that Julia Lawall noticed and
  patch descriptions updates with more details on how the introduced change=
s
  are going to be used.

- MAINTAINERS entries added.

The followings are discussion points which were raised but didn't result in
code changes in this iteration.

- There were discussions around exposing __setscheduler_prio() and, in v2,
  SCHED_CHANGE_BLOCK() in kernel/sched/sched.h. Switching scheduler
  implementations is innate for SCX. At the very least, it needs to be able
  to turn on and off the BPF scheduler which requires something equivalent
  to SCHED_CHANGE_BLOCK(). The use of __setscheduler_prio() depends on the
  behavior we want to present to userspace. The current one of using CFS as
  the fallback when BPF scheduler is not available seems more friendly and
  less error-prone to other options.

- Another discussion point was around for_each_active_class() and friends
  which skip over CFS or SCX when it's known that the sched_class must be
  empty. I left it as-is for now as it seems to be cleaner and more robust
  than trying to plug each operation which may added unnecessary overheads.

Thanks.

--
tejun

--000000000000e4d5e60617decdd5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" cla=
ss=3D"gmail_attr">---------- Forwarded message ---------<br>From: <strong c=
lass=3D"gmail_sendername" dir=3D"auto">Tejun Heo</strong> <span dir=3D"auto=
">&lt;<a href=3D"mailto:tj@kernel.org">tj@kernel.org</a>&gt;</span><br>Date=
: Wed, May 1, 2024, 8:13=E2=80=AFAM<br>Subject: [PATCHSET v6] sched: Implem=
ent BPF extensible scheduler class<br>To:  &lt;<a href=3D"mailto:torvalds@l=
inux-foundation.org">torvalds@linux-foundation.org</a>&gt;,  &lt;<a href=3D=
"mailto:mingo@redhat.com">mingo@redhat.com</a>&gt;,  &lt;<a href=3D"mailto:=
peterz@infradead.org">peterz@infradead.org</a>&gt;,  &lt;<a href=3D"mailto:=
juri.lelli@redhat.com">juri.lelli@redhat.com</a>&gt;,  &lt;<a href=3D"mailt=
o:vincent.guittot@linaro.org">vincent.guittot@linaro.org</a>&gt;,  &lt;<a h=
ref=3D"mailto:dietmar.eggemann@arm.com">dietmar.eggemann@arm.com</a>&gt;,  =
&lt;<a href=3D"mailto:rostedt@goodmis.org">rostedt@goodmis.org</a>&gt;,  &l=
t;<a href=3D"mailto:bsegall@google.com">bsegall@google.com</a>&gt;,  &lt;<a=
 href=3D"mailto:mgorman@suse.de">mgorman@suse.de</a>&gt;,  &lt;<a href=3D"m=
ailto:bristot@redhat.com">bristot@redhat.com</a>&gt;,  &lt;<a href=3D"mailt=
o:vschneid@redhat.com">vschneid@redhat.com</a>&gt;,  &lt;<a href=3D"mailto:=
ast@kernel.org">ast@kernel.org</a>&gt;,  &lt;<a href=3D"mailto:daniel@iogea=
rbox.net">daniel@iogearbox.net</a>&gt;,  &lt;<a href=3D"mailto:andrii@kerne=
l.org">andrii@kernel.org</a>&gt;,  &lt;<a href=3D"mailto:martin.lau@kernel.=
org">martin.lau@kernel.org</a>&gt;,  &lt;<a href=3D"mailto:joshdon@google.c=
om">joshdon@google.com</a>&gt;,  &lt;<a href=3D"mailto:brho@google.com">brh=
o@google.com</a>&gt;,  &lt;<a href=3D"mailto:pjt@google.com">pjt@google.com=
</a>&gt;,  &lt;<a href=3D"mailto:derkling@google.com">derkling@google.com</=
a>&gt;,  &lt;<a href=3D"mailto:haoluo@google.com">haoluo@google.com</a>&gt;=
,  &lt;<a href=3D"mailto:dvernet@meta.com">dvernet@meta.com</a>&gt;,  &lt;<=
a href=3D"mailto:dschatzberg@meta.com">dschatzberg@meta.com</a>&gt;,  &lt;<=
a href=3D"mailto:dskarlat@cs.cmu.edu">dskarlat@cs.cmu.edu</a>&gt;,  &lt;<a =
href=3D"mailto:riel@surriel.com">riel@surriel.com</a>&gt;,  &lt;<a href=3D"=
mailto:changwoo@igalia.com">changwoo@igalia.com</a>&gt;,  &lt;<a href=3D"ma=
ilto:himadrics@inria.fr">himadrics@inria.fr</a>&gt;,  &lt;<a href=3D"mailto=
:memxor@gmail.com">memxor@gmail.com</a>&gt;,  &lt;<a href=3D"mailto:andrea.=
righi@canonical.com">andrea.righi@canonical.com</a>&gt;,  &lt;<a href=3D"ma=
ilto:joel@joelfernandes.org">joel@joelfernandes.org</a>&gt;<br>Cc:  &lt;<a =
href=3D"mailto:linux-kernel@vger.kernel.org">linux-kernel@vger.kernel.org</=
a>&gt;,  &lt;<a href=3D"mailto:bpf@vger.kernel.org">bpf@vger.kernel.org</a>=
&gt;,  &lt;<a href=3D"mailto:kernel-team@meta.com">kernel-team@meta.com</a>=
&gt;<br></div><br><br>Updates and Changes<br>
-------------------<br>
<br>
This is v6 of sched_ext (SCX) patchset.<br>
<br>
During the past five months, both the development and adoption of sched_ext=
<br>
have been progressing briskly. Here are some highlights around adoption:<br=
>
<br>
- Valve has been working with Igalia to implement a sched_ext scheduler for=
<br>
=C2=A0 Steam Deck. The development is still in its early stages but they ar=
e<br>
=C2=A0 already happy with the results (more consistent FPS) and are plannin=
g to<br>
=C2=A0 enable the scheduler on Steam Deck.<br>
<br>
=C2=A0 =C2=A0 <a href=3D"https://github.com/sched-ext/scx/tree/main/scheds/=
rust/scx_lavd" rel=3D"noreferrer noreferrer" target=3D"_blank">https://gith=
ub.com/sched-ext/scx/tree/main/scheds/rust/scx_lavd</a><br>
=C2=A0 =C2=A0 <a href=3D"https://ossna2024.sched.com/event/1aBOT/optimizing=
-scheduler-for-linux-gaming-changwoo-min-igalia" rel=3D"noreferrer noreferr=
er" target=3D"_blank">https://ossna2024.sched.com/event/1aBOT/optimizing-sc=
heduler-for-linux-gaming-changwoo-min-igalia</a><br>
<br>
- Ubuntu is considering to include sched_ext in the upcoming 24.10 release.=
<br>
=C2=A0 Andrea Righi of Canonical has been actively working on a userspace<b=
r>
=C2=A0 scheduling framework since the end of the last year.<br>
<br>
=C2=A0 =C2=A0 <a href=3D"https://github.com/sched-ext/scx/tree/main/scheds/=
rust/scx_rustland" rel=3D"noreferrer noreferrer" target=3D"_blank">https://=
github.com/sched-ext/scx/tree/main/scheds/rust/scx_rustland</a><br>
=C2=A0 =C2=A0 <a href=3D"https://discourse.ubuntu.com/t/introducing-kernel-=
6-8-for-the-24-04-noble-numbat-release/41958" rel=3D"noreferrer noreferrer"=
 target=3D"_blank">https://discourse.ubuntu.com/t/introducing-kernel-6-8-fo=
r-the-24-04-noble-numbat-release/41958</a><br>
<br>
- We (Meta) are now deploying a sched_ext scheduler on one large production=
<br>
=C2=A0 workload (web), conducting wide-scale verification benchmark on anot=
her<br>
=C2=A0 (ads), and preparing production deployment on yet another workload (=
ML<br>
=C2=A0 training). These are all using scx_layered, which is useful for quic=
k<br>
=C2=A0 prototyping and experiments. In the process, we&#39;ve identified se=
veral<br>
=C2=A0 common strategies which are useful across multiple workloads (e.g.<b=
r>
=C2=A0 soft-affinitizing related threads) and are in the process of impleme=
nting<br>
=C2=A0 something more generic and polished.<br>
<br>
=C2=A0 =C2=A0 <a href=3D"https://github.com/sched-ext/scx/tree/main/scheds/=
rust/scx_layered" rel=3D"noreferrer noreferrer" target=3D"_blank">https://g=
ithub.com/sched-ext/scx/tree/main/scheds/rust/scx_layered</a><br>
<br>
- Because Google&#39;s ghOSt framework (userspace scheduling framework with=
 BPF<br>
=C2=A0 hooks for optimization) is already available in the Google fleet, th=
at&#39;s<br>
=C2=A0 what Google is currently experimenting with. They are seeing promisi=
ng<br>
=C2=A0 results in a couple important workloads (search and cloud hosting) a=
nd<br>
=C2=A0 trying to move on to deployment. The gap between ghOSt and sched_ext=
 is<br>
=C2=A0 not wide at this point and Google is working to port ghOSt scheduler=
s on<br>
=C2=A0 top of sched_ext.<br>
<br>
- ChromeOS is looking into scx_layered with focus on reducing latency (as a=
<br>
=C2=A0 replacement to RT) with a prototype port of sched_ext on ChromeOS.<b=
r>
<br>
- Oculus is facing a number of scheduling related challenges and looking<br=
>
=C2=A0 into sched_ext. They have an android port that they&#39;re experimen=
ting with<br>
=C2=A0 although any actual deployment would have to wait until a newer plat=
form<br>
=C2=A0 kernel can be rolled out which will take quite a while.<br>
<br>
Although sched_ext is still out of tree, we&#39;re seeing wide interest and=
<br>
adoption across multiple organizations and different use cases. Plus, our<b=
r>
first-hand experiences at Meta and the reports from other users definitivel=
y<br>
confirm the hypothesized merits of sched_ext - among others, lowered barrie=
r<br>
of entry coupled with rapid and safe experiments leading to insights and<br=
>
performance gains in an easily deployable form.<br>
<br>
For example, scx_rusty and scx_layered are proving substantial benefits of<=
br>
better work conservation within L3 domains, soft affinity (flexibly groupin=
g<br>
related threads together) and application specific prioritization for<br>
request-driven server workloads. scx_lavd is demonstrating the benefits of =
a<br>
new set of heuristics based on the runtime and the frequencies of waking up=
<br>
others and being woken up for gaming, and possibly other interactive,<br>
workloads.<br>
<br>
We&#39;re seeing constantly increasing interests both within Meta and from =
the<br>
wider community. The benefits are inherent and clear enough that I don&#39;=
t see<br>
a reason for the trend to change. However, being out of tree does add a lot=
<br>
of overhead in terms of decision making and logistics for everyone involved=
.<br>
<br>
Given that there already is substantial adoption which continues to grow an=
d<br>
sched_ext doesn&#39;t affect the built-in schedulers or the rest of kernel =
in an<br>
invasive manner, I believe it&#39;s reasonable to consider sched_ext for<br=
>
inclusion. I and David Vernet would be happy to run the tree, respond to bu=
g<br>
reports, coordinate with scheduler core or any other kernel subsystem that<=
br>
sched_ext may interact with.<br>
<br>
<br>
If you&#39;re interested in high-level arguments for and against. Please re=
fer<br>
to the following discussion in the v4 posting:<br>
<br>
=C2=A0 <a href=3D"http://lkml.kernel.org/r/20230711011412.100319-1-tj@kerne=
l.org" rel=3D"noreferrer noreferrer" target=3D"_blank">http://lkml.kernel.o=
rg/r/20230711011412.100319-1-tj@kernel.org</a><br>
<br>
<br>
If you&#39;re interested in getting your hands dirty, the following reposit=
ory<br>
contains example and practical schedulers along with documentation on how t=
o<br>
get started:<br>
<br>
=C2=A0 <a href=3D"https://github.com/sched-ext/scx" rel=3D"noreferrer noref=
errer" target=3D"_blank">https://github.com/sched-ext/scx</a><br>
<br>
The kernel and scheduler packages are available for Ubuntu, CachyOS and Arc=
h<br>
(through CachyOS repo). Fedora packaging is in the works.<br>
<br>
There are also a slack and weekly office hour:<br>
<br>
=C2=A0 <a href=3D"https://schedextworkspace.slack.com" rel=3D"noreferrer no=
referrer" target=3D"_blank">https://schedextworkspace.slack.com</a><br>
<br>
=C2=A0 Office Hour: Mondays at 16:00 UTC (8:00 AM PST, 16:00 UTC, 17:00 CES=
T,<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01:00 AM KST). Please=
 see the #office-hours channel for the<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0zoom invite.<br>
<br>
<br>
The followings are significant changes from v5<br>
(<a href=3D"http://lkml.kernel.org/r/20231111024835.2164816-1-tj@kernel.org=
" rel=3D"noreferrer noreferrer" target=3D"_blank">http://lkml.kernel.org/r/=
20231111024835.2164816-1-tj@kernel.org</a>). For more<br>
detailed list of changes, please refer to the patch descriptions.<br>
<br>
- scx_pair, scx_userland, scx_next and the rust schedulers are removed from=
<br>
=C2=A0 kernel tree and now hosted in <a href=3D"https://github.com/sched-ex=
t/scx" rel=3D"noreferrer noreferrer" target=3D"_blank">https://github.com/s=
ched-ext/scx</a> along with<br>
=C2=A0 all other schedulers.<br>
<br>
- SCX_OPS_DISABLING state is replaced with the new bypass mechanism which<b=
r>
=C2=A0 allows temporarily putting the system into simple FIFO scheduling mo=
de. In<br>
=C2=A0 addition to the shut down path, this is used to isolate the BPF sche=
duler<br>
=C2=A0 across power management events.<br>
<br>
- ops.prep_enable() is replaced with ops.init_task() and<br>
=C2=A0 ops.enable/disable() are now called whenever the task enters and lea=
ves<br>
=C2=A0 sched_ext instead of when the task becomes schedulable on sched_ext =
and<br>
=C2=A0 stops being so. A new operation - ops.exit_task() - is called when t=
he<br>
=C2=A0 task stops being schedulable on sched_ext.<br>
<br>
- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This<br>
=C2=A0 removes the need for communicating local dispatch decision made by<b=
r>
=C2=A0 ops.select_cpu() to ops.enqueue() via per-task storage.<br>
<br>
- SCX_TASK_ENQ_LOCAL which told the BPF scheduler that scx_select_cpu_dfl()=
<br>
=C2=A0 wants the task to be dispatched to the local DSQ was removed. Instea=
d,<br>
=C2=A0 scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suita=
ble<br>
=C2=A0 idle CPU. If such behavior is not desired, users can use<br>
=C2=A0 scx_bpf_select_cpu_dfl() which returns the verdict in a bool out par=
am.<br>
<br>
- Dispatch decisions made in ops.dispatch() may now be cancelled with a new=
<br>
=C2=A0 scx_bpf_dispatch_cancel() kfunc.<br>
<br>
- A new SCX_KICK_IDLE flag is available for use with scx_bpf_kick_cpu() to<=
br>
=C2=A0 only send a resched IPI if the target CPU is idle.<br>
<br>
- exit_code added to scx_exit_info. This is used to indicate different exit=
<br>
=C2=A0 conditions on non-error exits and enables e.g. handling CPU hotplugs=
 by<br>
=C2=A0 restarting the scheduler.<br>
<br>
- Debug dump added. When the BPF scheduler gets aborted, the states of all<=
br>
=C2=A0 runqueues and runnable tasks are captured and sent to the scheduler =
binary<br>
=C2=A0 to aid debugging. See <a href=3D"https://github.com/sched-ext/scx/is=
sues/234" rel=3D"noreferrer noreferrer" target=3D"_blank">https://github.co=
m/sched-ext/scx/issues/234</a> for an<br>
=C2=A0 example of the debug dump being used to root cause a bug in scx_lavd=
.<br>
<br>
- The BPF scheduler can now iterate DSQs and consume specific tasks.<br>
<br>
- CPU frequency scaling support added through cpuperf kfunc interface.<br>
<br>
- The current state of sched_ext can now be monitored through files under<b=
r>
=C2=A0 /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to en=
able<br>
=C2=A0 monitoring on kernels which don&#39;t enable debugfs. A drgn script<=
br>
=C2=A0 tools/sched_ext/scx_show_state.py is added for additional visibility=
.<br>
<br>
- tools/sched_ext/include/scx/compat[.bpf].h and other facilities to allow<=
br>
=C2=A0 schedulers to be loaded on older kernels are added. The current tent=
ative<br>
=C2=A0 target is maintaining backward compatibility for at least one major =
kernel<br>
=C2=A0 release where reasonable.<br>
<br>
- Code reorganized so that only the parts necessary to integrate with the<b=
r>
=C2=A0 rest of the kernel are in the header files.<br>
<br>
<br>
Overview<br>
--------<br>
<br>
This patch set proposes a new scheduler class called =E2=80=98ext_sched_cla=
ss=E2=80=99, or<br>
sched_ext, which allows scheduling policies to be implemented as BPF progra=
ms.<br>
<br>
More details will be provided on the overall architecture of sched_ext<br>
throughout the various patches in this set, as well as in the =E2=80=9CHow=
=E2=80=9D section<br>
below. We realize that this patch set is a significant proposal, so we will=
 be<br>
going into depth in the following =E2=80=9CMotivation=E2=80=9D section to e=
xplain why we think<br>
it=E2=80=99s justified. That section is laid out as follows, touching on th=
ree main<br>
axes where we believe that sched_ext provides significant value:<br>
<br>
1. Ease of experimentation and exploration: Enabling rapid iteration of new=
<br>
=C2=A0 =C2=A0scheduling policies.<br>
<br>
2. Customization: Building application-specific schedulers which implement<=
br>
=C2=A0 =C2=A0policies that are not applicable to general-purpose schedulers=
.<br>
<br>
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling<br>
=C2=A0 =C2=A0policies in production environments.<br>
<br>
After the motivation section, we=E2=80=99ll provide a more detailed (but st=
ill<br>
high-level) overview of how sched_ext works.<br>
<br>
<br>
Motivation<br>
----------<br>
<br>
1. Ease of experimentation and exploration<br>
<br>
*Why is exploration important?*<br>
<br>
Scheduling is a challenging problem space. Small changes in scheduling<br>
behavior can have a significant impact on various components of a system, w=
ith<br>
the corresponding effects varying widely across different platforms,<br>
architectures, and workloads.<br>
<br>
While complexities have always existed in scheduling, they have increased<b=
r>
dramatically over the past 10-15 years. In the mid-late 2000s, cores were<b=
r>
typically homogeneous and further apart from each other, with the criteria =
for<br>
scheduling being roughly the same across the entire die.<br>
<br>
Systems in the modern age are by comparison much more complex. Modern CPU<b=
r>
designs, where the total power budget of all CPU cores often far exceeds th=
e<br>
power budget of the socket, with dynamic frequency scaling, and with or<br>
without chiplets, have significantly expanded the scheduling problem space.=
<br>
Cache hierarchies have become less uniform, with Core Complex (CCX) designs=
<br>
such as recent AMD processors having multiple shared L3 caches within a sin=
gle<br>
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.=
<br>
<br>
Use-cases have become increasingly complex and diverse as well. Application=
s<br>
such as mobile and VR have strict latency requirements to avoid missing<br>
deadlines that impact user experience. Stacking workloads in servers is<br>
constantly pushing the demands on the scheduler in terms of workload isolat=
ion<br>
and resource distribution.<br>
<br>
Experimentation and exploration are important for any non-trivial problem<b=
r>
space. However, given the recent hardware and software developments, we<br>
believe that experimentation and exploration are not just important, but<br=
>
_critical_ in the scheduling problem space.<br>
<br>
Indeed, other approaches in industry are already being explored. AMD has<br=
>
proposed an experimental patch set [0] which enables userspace to provide<b=
r>
hints to the scheduler via =E2=80=9CUserspace Hinting=E2=80=9D. The approac=
h adds a prctl()<br>
API which allows callers to set a numerical =E2=80=9Chint=E2=80=9D value on=
 a struct<br>
task_struct. This hint is then optionally read by the scheduler to adjust t=
he<br>
cost calculus for various scheduling decisions.<br>
<br>
[0]: <a href=3D"https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek=
.nayak@amd.com/" rel=3D"noreferrer noreferrer" target=3D"_blank">https://lo=
re.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/</a><br>
<br>
Huawei have also expressed interest [1] in enabling some form of programmab=
le<br>
scheduling. While we=E2=80=99re unaware of any patch sets which have been s=
ent to the<br>
upstream list for this proposal, it similarly illustrates the need for more=
<br>
flexibility in the scheduler.<br>
<br>
[1]: <a href=3D"https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c1=
77188@huawei.com/" rel=3D"noreferrer noreferrer" target=3D"_blank">https://=
lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com/</a><br=
>
<br>
Additionally, Google has developed ghOSt [2] with the goal of enabling cust=
om,<br>
userspace driven scheduling policies. Prior presentations at LPC [3] have<b=
r>
discussed ghOSt and how BPF can be used to accelerate scheduling.<br>
<br>
[2]: <a href=3D"https://dl.acm.org/doi/pdf/10.1145/3477132.3483542" rel=3D"=
noreferrer noreferrer" target=3D"_blank">https://dl.acm.org/doi/pdf/10.1145=
/3477132.3483542</a><br>
[3]: <a href=3D"https://lpc.events/event/16/contributions/1365/" rel=3D"nor=
eferrer noreferrer" target=3D"_blank">https://lpc.events/event/16/contribut=
ions/1365/</a><br>
<br>
*Why can=E2=80=99t we just explore directly with CFS?*<br>
<br>
Experimenting with CFS directly or implementing a new sched_class from scra=
tch<br>
is of course possible, but is often difficult and time consuming. Newcomers=
 to<br>
the scheduler often require years to understand the codebase and become<br>
productive contributors. Even for seasoned kernel engineers, experimenting<=
br>
with and upstreaming features can take a very long time. The iteration proc=
ess<br>
itself is also time consuming, as testing scheduler changes on real hardwar=
e<br>
requires reinstalling the kernel and rebooting the host.<br>
<br>
Core scheduling is an example of a feature that took a significant amount o=
f<br>
time and effort to integrate into the kernel. Part of the difficulty with c=
ore<br>
scheduling was the inherent mismatch in abstraction between the desire to<b=
r>
perform core-wide scheduling, and the per-cpu design of the kernel schedule=
r.<br>
This caused issues, for example ensuring proper fairness between the<br>
independent runqueues of SMT siblings.<br>
<br>
The high barrier to entry for working on the scheduler is an impediment to<=
br>
academia as well. Master=E2=80=99s/PhD candidates who are interested in imp=
roving the<br>
scheduler will spend years ramping-up, only to complete their degrees just =
as<br>
they=E2=80=99re finally ready to make significant changes. A lower entrance=
 barrier<br>
would allow researchers to more quickly ramp up, test out hypotheses, and<b=
r>
iterate on novel ideas. Research methodology is also severely hampered by t=
he<br>
high barrier of entry to make modifications; for example, the Shenango [4] =
and<br>
Shinjuku scheduling policies used sched affinity to replicate the desired<b=
r>
policy semantics, due to the difficulty of incorporating these policies int=
o<br>
the kernel directly.<br>
<br>
[4]: <a href=3D"https://www.usenix.org/system/files/nsdi19-ousterhout.pdf" =
rel=3D"noreferrer noreferrer" target=3D"_blank">https://www.usenix.org/syst=
em/files/nsdi19-ousterhout.pdf</a><br>
<br>
The iterative process itself also imposes a significant cost to working on =
the<br>
scheduler. Testing changes requires developers to recompile and reinstall t=
he<br>
kernel, reboot their machines, rewarm their workloads, and then finally rer=
un<br>
their benchmarks. Though some of this overhead could potentially be mitigat=
ed<br>
by enabling schedulers to be implemented as kernel modules, a machine crash=
 or<br>
subtle system state corruption is always only one innocuous mistake away.<b=
r>
These problems are exacerbated when testing production workloads in a<br>
datacenter environment as well, where multiple hosts may be involved in an<=
br>
experiment; requiring a significantly longer ramp up time. Warming up memca=
che<br>
instances in the Meta production environment takes hours, for example.<br>
<br>
*How does sched_ext help with exploration?*<br>
<br>
sched_ext attempts to address all of the problems described above. In this<=
br>
section, we=E2=80=99ll describe the benefits to experimentation and explora=
tion that<br>
are afforded by sched_ext, provide real-world examples of those benefits, a=
nd<br>
discuss some of the trade-offs and considerations in our design choices.<br=
>
<br>
One of our main goals was to lower the barrier to entry for experimenting<b=
r>
with the scheduler. sched_ext provides ergonomic callbacks and helpers to<b=
r>
ease common operations such as managing idle CPUs, scheduling tasks on<br>
arbitrary CPUs, handling preemptions from other scheduling classes, and<br>
more. While sched_ext does require some ramp-up, the complexity is<br>
self-contained, and the learning curve gradual. Developers can ramp up by<b=
r>
first implementing simple policies such as global weighted vtime scheduling=
<br>
in only tens of lines of code, and then continue to learn the APIs and<br>
building blocks available with sched_ext as they build more featureful and<=
br>
complex schedulers.<br>
<br>
Another critical advantage provided by sched_ext is the use of BPF. BPF<br>
provides strong safety guarantees by statically analyzing programs at load<=
br>
time to ensure that they cannot corrupt or crash the system. sched_ext<br>
guarantees system integrity no matter what BPF scheduler is loaded, and<br>
provides mechanisms to safely disable the current BPF scheduler and migrate=
<br>
tasks back to a trusted scheduler. For example, we also implement in-kernel=
<br>
safety mechanisms to guarantee that a misbehaving scheduler cannot<br>
indefinitely starve tasks. BPF also enables sched_ext to significantly impr=
ove<br>
iteration speed for running experiments. Loading and unloading a BPF schedu=
ler<br>
is simply a matter of running and terminating a sched_ext binary.<br>
<br>
BPF also provides programs with a rich set of APIs, such as maps, kfuncs,<b=
r>
and BPF helpers. In addition to providing useful building blocks to program=
s<br>
that run entirely in kernel space (such as many of our example schedulers),=
<br>
these APIs also allow programs to leverage user space in making scheduling<=
br>
decisions. Specifically, the Atropos sample scheduler has a relatively<br>
simple weighted vtime or FIFO scheduling layer in BPF, paired with a load<b=
r>
balancing component in userspace written in Rust. As described in more<br>
detail below, we also built a more general user-space scheduling framework<=
br>
called =E2=80=9Crhone=E2=80=9D by leveraging various BPF features.<br>
<br>
On the other hand, BPF does have shortcomings, as can be plainly seen from<=
br>
the complexity in some of the example schedulers. scx_pair.bpf.c illustrate=
s<br>
this point well. To start, it requires a good amount of code to emulate<br>
cgroup-local-storage. In the kernel proper, this would simply be a matter o=
f<br>
adding another pointer to the struct cgroup, but in BPF, it requires a<br>
complex juggling of data amongst multiple different maps, a good amount of<=
br>
boilerplate code, and some unwieldy bpf_loop()=E2=80=98s and atomics. The c=
ode is<br>
also littered with explicit and often unnecessary sanity checks to appease<=
br>
the verifier.<br>
<br>
That being said, BPF is being rapidly improved. For example, Yonghong Song<=
br>
recently upstreamed a patch set [5] to add a cgroup local storage map type,=
<br>
allowing scx_pair.bpf.c to be simplified. There are plans to address other<=
br>
issues as well, such as providing statically-verified locking, and avoiding=
<br>
the need for unnecessary sanity checks. Addressing these shortcomings is a<=
br>
high priority for BPF, and as progress continues to be made, we expect most=
<br>
deficiencies to be addressed in the not-too-distant future.<br>
<br>
[5]: <a href=3D"https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.=
com/" rel=3D"noreferrer noreferrer" target=3D"_blank">https://lore.kernel.o=
rg/bpf/20221026042835.672317-1-yhs@fb.com/</a><br>
<br>
Yet another exploration advantage of sched_ext is helping widening the scop=
e<br>
of experiments. For example, sched_ext makes it easy to defer CPU assignmen=
t<br>
until a task starts executing, allowing schedulers to share scheduling queu=
es<br>
at any granularity (hyper-twin, CCX and so on). Additionally, higher level<=
br>
frameworks can be built on top to further widen the scope. For example, the=
<br>
aforementioned =E2=80=9Crhone=E2=80=9D [6] library allows implementing sche=
duling policies in<br>
user-space by encapsulating the complexity around communicating scheduling<=
br>
decisions with the kernel. This allows taking advantage of a richer<br>
programming environment in user-space, enabling experimenting with, for<br>
instance, more complex mathematical models.<br>
<br>
[6]: <a href=3D"https://github.com/Decave/rhone" rel=3D"noreferrer noreferr=
er" target=3D"_blank">https://github.com/Decave/rhone</a><br>
<br>
sched_ext also allows developers to leverage machine learning. At Meta, we<=
br>
experimented with using machine learning to predict whether a running task<=
br>
would soon yield its CPU. These predictions can be used to aid the schedule=
r<br>
in deciding whether to keep a runnable task on its current CPU rather than<=
br>
migrating it to an idle CPU, with the hope of avoiding unnecessary cache<br=
>
misses. Using a tiny neural net model with only one hidden layer of size 16=
,<br>
and a decaying count of 64 syscalls as a feature, we were able to achieve a=
<br>
15% throughput improvement on an Nginx benchmark, with an 87% inference<br>
accuracy.<br>
<br>
2. Customization<br>
<br>
This section discusses how sched_ext can enable users to run workloads on<b=
r>
application-specific schedulers.<br>
<br>
*Why deploy custom schedulers rather than improving CFS?*<br>
<br>
Implementing application-specific schedulers and improving CFS are not<br>
conflicting goals. Scheduling features explored with sched_ext which yield<=
br>
beneficial results, and which are sufficiently generalizable, can and shoul=
d<br>
be integrated into CFS. However, CFS is fundamentally designed to be a gene=
ral<br>
purpose scheduler, and thus is not conducive to being extended with some<br=
>
highly targeted application or hardware specific changes.<br>
<br>
Targeted, bespoke scheduling has many potential use cases. For example, VM<=
br>
scheduling can make certain optimizations that are infeasible in CFS due to=
<br>
the constrained problem space (scheduling a static number of long-running<b=
r>
VCPUs versus an arbitrary number of threads). Additionally, certain<br>
applications might want to make targeted policy decisions based on hints<br=
>
directly from the application (for example, a service that knows the differ=
ent<br>
deadlines of incoming RPCs).<br>
<br>
Google has also experimented with some promising, novel scheduling policies=
.<br>
One example is =E2=80=9Ccentral=E2=80=9D scheduling, wherein a single CPU m=
akes all<br>
scheduling decisions for the entire system. This allows most cores on the<b=
r>
system to be fully dedicated to running workloads, and can have significant=
<br>
performance improvements for certain use cases. For example, central<br>
scheduling with VCPUs can avoid expensive vmexits and cache flushes, by<br>
instead delegating the responsibility of preemption checks from the tick to=
<br>
a single CPU. See scx_central.bpf.c for a simple example of a central<br>
scheduling policy built in sched_ext.<br>
<br>
Some workloads also have non-generalizable constraints which enable<br>
optimizations in a scheduling policy which would otherwise not be feasible.=
<br>
For example,VM workloads at Google typically have a low overcommit ratio<br=
>
compared to the number of physical CPUs. This allows the scheduler to suppo=
rt<br>
bounded tail latencies, as well as longer blocks of uninterrupted time.<br>
<br>
Yet another interesting use case is the scx_flatcg scheduler, which is in<b=
r>
0024-sched_ext-Add-cgroup-support.patch and provides a flattened<br>
hierarchical vtree for cgroups. This scheduler does not account for<br>
thundering herd problems among cgroups, and therefore may not be suitable<b=
r>
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache=
<br>
serving a CGI script calculating sha1sum of a small file, it outperformed<b=
r>
CFS by ~3% with CPU controller disabled and by ~10% with two apache<br>
instances competing with 2:1 weight ratio nested four level deep.<br>
<br>
[7] <a href=3D"https://github.com/wg/wrk" rel=3D"noreferrer noreferrer" tar=
get=3D"_blank">https://github.com/wg/wrk</a><br>
<br>
Certain industries require specific scheduling behaviors that do not apply<=
br>
broadly. For example, ARINC 653 defines scheduling behavior that is widely<=
br>
used by avionic software, and some out-of-tree implementations<br>
(<a href=3D"https://ieeexplore.ieee.org/document/7005306" rel=3D"noreferrer=
 noreferrer" target=3D"_blank">https://ieeexplore.ieee.org/document/7005306=
</a>) have been built. While the<br>
upstream community may decide to merge one such implementation in the futur=
e,<br>
it would also be entirely reasonable to not do so given the narrowness of<b=
r>
use-case, and non-generalizable, strict requirements. Such cases can be wel=
l<br>
served by sched_ext in all stages of the software development lifecycle --<=
br>
development, testing, deployment and maintenance.<br>
<br>
There are also classes of policy exploration, such as machine learning, or<=
br>
responding in real-time to application hints, that are significantly harder=
<br>
(and not necessarily appropriate) to integrate within the kernel itself.<br=
>
<br>
*Won=E2=80=99t this increase fragmentation?*<br>
<br>
We acknowledge that to some degree, sched_ext does run the risk of<br>
increasing the fragmentation of scheduler implementations. As a result of<b=
r>
exploration, however, we believe that enabling the larger ecosystem to<br>
innovate will ultimately accelerate the overall development and performance=
<br>
of Linux.<br>
<br>
BPF programs are required to be GPLv2, which is enforced by the verifier on=
<br>
program loads. With regards to API stability, just as with other semi-inter=
nal<br>
interfaces such as BPF kfuncs, we won=E2=80=99t be providing any API stabil=
ity<br>
guarantees to BPF schedulers. While we intend to make an effort to provide<=
br>
compatibility when possible, we will not provide any explicit, strong<br>
guarantees as the kernel typically does with e.g. UAPI headers. For users w=
ho<br>
decide to keep their schedulers out-of-tree,the licensing and maintenance<b=
r>
overheads will be fundamentally the same as for carrying out-of-tree patche=
s.<br>
<br>
With regards to the schedulers included in this patch set, and any other<br=
>
schedulers we implement in the future, both Meta and Google will open-sourc=
e<br>
all of the schedulers we implement which have any relevance to the broader<=
br>
upstream community. We expect that some of these, such as the simple exampl=
e<br>
schedulers and scx_rusty scheduler, will be upstreamed as part of the kerne=
l<br>
tree. Distros will be able to package and release these schedulers with the=
<br>
kernel, allowing users to utilize these schedulers out-of-the-box without<b=
r>
requiring any additional work or dependencies such as clang or building the=
<br>
scheduler programs themselves. Other schedulers and scheduling frameworks<b=
r>
such as rhone may be open-sourced through separate per-project repos.<br>
<br>
3. Rapid scheduler deployments<br>
<br>
Rolling out kernel upgrades is a slow and iterative process. At a large sca=
le<br>
it can take months to roll a new kernel out to a fleet of servers. While th=
is<br>
latency is expected and inevitable for normal kernel upgrades, it can becom=
e<br>
highly problematic when kernel changes are required to fix bugs. Livepatch =
[8]<br>
is available to quickly roll out critical security fixes to large fleets, b=
ut<br>
the scope of changes that can be applied with livepatching is fairly limite=
d,<br>
and would likely not be usable for patching scheduling policies. With<br>
sched_ext, new scheduling policies can be rapidly rolled out to production<=
br>
environments.<br>
<br>
[8]: <a href=3D"https://www.kernel.org/doc/html/latest/livepatch/livepatch.=
html" rel=3D"noreferrer noreferrer" target=3D"_blank">https://www.kernel.or=
g/doc/html/latest/livepatch/livepatch.html</a><br>
<br>
As an example, one of the variants of the L1 Terminal Fault (L1TF) [9]<br>
vulnerability allows a VCPU running a VM to read arbitrary host kernel<br>
memory for pages in L1 data cache. The solution was to implement core<br>
scheduling, which ensures that tasks running as hypertwins have the same<br=
>
=E2=80=9Ccookie=E2=80=9D.<br>
<br>
[9]: <a href=3D"https://www.intel.com/content/www/us/en/architecture-and-te=
chnology/l1tf.html" rel=3D"noreferrer noreferrer" target=3D"_blank">https:/=
/www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html</a><=
br>
<br>
While core scheduling works well, it took a long time to finalize and land<=
br>
upstream. This long rollout period was painful, and required organizations =
to<br>
make difficult choices amongst a bad set of options. Some companies such as=
<br>
Google chose to implement and use their own custom L1TF-safe scheduler, oth=
ers<br>
chose to run without hyper-threading enabled, and yet others left<br>
hyper-threading enabled and crossed their fingers.<br>
<br>
Once core scheduling was upstream, organizations had to upgrade the kernels=
 on<br>
their entire fleets. As downtime is not an option for many, these upgrades =
had<br>
to be gradually rolled out, which can take a very long time for large fleet=
s.<br>
<br>
An example of an sched_ext scheduler that illustrates core scheduling<br>
semantics is scx_pair.bpf.c, which co-schedules pairs of tasks from the sam=
e<br>
cgroup, and is resilient to L1TF vulnerabilities. While this example<br>
scheduler is certainly not suitable for production in its current form, a<b=
r>
similar scheduler that is more performant and featureful could be written<b=
r>
and deployed if necessary.<br>
<br>
Rapid scheduling deployments can similarly be useful to quickly roll-out ne=
w<br>
scheduling features without requiring kernel upgrades. At Google, for examp=
le,<br>
it was observed that some low-priority workloads were causing degraded<br>
performance for higher-priority workloads due to consuming a disproportiona=
te<br>
share of memory bandwidth. While a temporary mitigation was to use sched<br=
>
affinity to limit the footprint of this low-priority workload to a small<br=
>
subset of CPUs, a preferable solution would be to implement a more featuref=
ul<br>
task-priority mechanism which automatically throttles lower-priority tasks<=
br>
which are causing memory contention for the rest of the system. Implementin=
g<br>
this in CFS and rolling it out to the fleet could take a very long time.<br=
>
<br>
sched_ext would directly address these gaps. If another hardware bug or<br>
resource contention issue comes in that requires scheduler support to<br>
mitigate, sched_ext can be used to experiment with and test different<br>
policies. Once a scheduler is available, it can quickly be rolled out to as=
<br>
many hosts as necessary, and function as a stop-gap solution until a<br>
longer-term mitigation is upstreamed.<br>
<br>
<br>
How<br>
---<br>
<br>
sched_ext is a new sched_class which allows scheduling policies to be<br>
implemented in BPF programs.<br>
<br>
sched_ext leverages BPF=E2=80=99s struct_ops feature to define a structure =
which<br>
exports function callbacks and flags to BPF programs that wish to implement=
<br>
scheduling policies. The struct_ops structure exported by sched_ext is stru=
ct<br>
sched_ext_ops, and is conceptually similar to struct sched_class. The role =
of<br>
sched_ext is to map the complex sched_class callbacks to the more simple an=
d<br>
ergonomic struct sched_ext_ops callbacks.<br>
<br>
Unlike some other BPF program types which have ABI requirements due to<br>
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provid=
es<br>
us with the flexibility to change the APIs provided to schedulers as<br>
necessary. BPF struct_ops is also already being used successfully in other<=
br>
subsystems, such as in support of TCP congestion control.<br>
<br>
The only struct_ops field that is required to be specified by a scheduler i=
s<br>
the =E2=80=98name=E2=80=99 field. Otherwise, sched_ext will provide sane de=
fault behavior,<br>
such as automatically choosing an idle CPU on the task wakeup path if<br>
.select_cpu() is missing.<br>
<br>
*Dispatch queues*<br>
<br>
To bridge the workflow imbalance between the scheduler core and sched_ext_o=
ps<br>
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq&#39;s). =
By<br>
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU ds=
q<br>
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be=
<br>
used by a scheduler that doesn&#39;t require it. As described in more detai=
l<br>
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when<br>
putting the next task on the CPU. The BPF scheduler can manage an arbitrary=
<br>
number of dsq&#39;s using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().<b=
r>
<br>
*Scheduling cycle*<br>
<br>
The following briefly shows a typical workflow for how a waking task is<br>
scheduled and executed.<br>
<br>
1. When a task is waking up, .select_cpu() is the first operation invoked.<=
br>
=C2=A0 =C2=A0This serves two purposes. It both allows a scheduler to optimi=
ze task<br>
=C2=A0 =C2=A0placement by specifying a CPU where it expects the task to eve=
ntually be<br>
=C2=A0 =C2=A0scheduled, and the latter is that the selected CPU will be wok=
en if it=E2=80=99s<br>
=C2=A0 =C2=A0idle.<br>
<br>
2. Once the target CPU is selected, .enqueue() is invoked. It can make one =
of<br>
=C2=A0 =C2=A0the following decisions:<br>
<br>
=C2=A0 =C2=A0- Immediately dispatch the task to either the global dsq (SCX_=
DSQ_GLOBAL)<br>
=C2=A0 =C2=A0 =C2=A0or the current CPU=E2=80=99s local dsq (SCX_DSQ_LOCAL).=
<br>
<br>
=C2=A0 =C2=A0- Immediately dispatch the task to a user-created dispatch que=
ue.<br>
<br>
=C2=A0 =C2=A0- Queue the task on the BPF side, e.g. in an rbtree map for a =
vruntime<br>
=C2=A0 =C2=A0 =C2=A0scheduler, with the intention of dispatching it at a la=
ter time from<br>
=C2=A0 =C2=A0 =C2=A0.dispatch().<br>
<br>
3. When a CPU is ready to schedule, it first looks at its local dsq. If emp=
ty,<br>
=C2=A0 =C2=A0it invokes .consume() which should make one or more scx_bpf_co=
nsume() calls<br>
=C2=A0 =C2=A0to consume tasks from dsq&#39;s. If a scx_bpf_consume() call s=
ucceeds, the CPU<br>
=C2=A0 =C2=A0has the next task to run and .consume() can return. If .consum=
e() is not<br>
=C2=A0 =C2=A0defined, sched_ext will by-default consume from only the built=
-in<br>
=C2=A0 =C2=A0SCX_DSQ_GLOBAL dsq.<br>
<br>
4. If there&#39;s still no task to run, .dispatch() is invoked which should=
 make<br>
=C2=A0 =C2=A0one or more scx_bpf_dispatch() calls to dispatch tasks from th=
e BPF<br>
=C2=A0 =C2=A0scheduler to one of the dsq&#39;s. If more than one task has b=
een dispatched,<br>
=C2=A0 =C2=A0go back to the previous consumption step.<br>
<br>
*Verifying callback behavior*<br>
<br>
sched_ext always verifies that any value returned from a callback is valid,=
<br>
and will issue an error and unload the scheduler if it is not. For example,=
 if<br>
.select_cpu() returns an invalid CPU, or if an attempt is made to invoke th=
e<br>
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remai=
ns<br>
runnable for too long without being scheduled, sched_ext will detect it and=
<br>
error-out the scheduler.<br>
<br>
<br>
Closing Thoughts<br>
----------------<br>
<br>
Both Meta and Google have experimented quite a lot with schedulers in the<b=
r>
last several years. Google has benchmarked various workloads using user<br>
space scheduling, and have achieved performance wins by trading off<br>
generality for application specific needs. At Meta, we are actively<br>
experimenting with multiple production workloads and seeing significant<br>
performance gains, and are in the process of deploying sched_ext schedulers=
<br>
on production workloads at scale. We expect to leverage it extensively to<b=
r>
run various experiments and develop customized schedulers for a number of<b=
r>
critical workloads.<br>
<br>
In closing, both Meta and Google believe that sched_ext will significantly<=
br>
evolve how the broader community explores the scheduling problem space,<br>
while also enabling targeted policies for custom applications. We=E2=80=99l=
l be able<br>
to experiment easier and faster, explore uncharted areas, and deploy<br>
emergency scheduler changes when necessary. The same applies to anyone who<=
br>
wants to work on the scheduler, including academia and specialized<br>
industries. sched_ext will push forward the state of the art when it comes<=
br>
to scheduling and performance in Linux.<br>
<br>
<br>
Written By<br>
----------<br>
<br>
David Vernet &lt;<a href=3D"mailto:dvernet@meta.com" target=3D"_blank" rel=
=3D"noreferrer">dvernet@meta.com</a>&gt;<br>
Josh Don &lt;<a href=3D"mailto:joshdon@google.com" target=3D"_blank" rel=3D=
"noreferrer">joshdon@google.com</a>&gt;<br>
Tejun Heo &lt;<a href=3D"mailto:tj@kernel.org" target=3D"_blank" rel=3D"nor=
eferrer">tj@kernel.org</a>&gt;<br>
Barret Rhoden &lt;<a href=3D"mailto:brho@google.com" target=3D"_blank" rel=
=3D"noreferrer">brho@google.com</a>&gt;<br>
<br>
<br>
Supported By<br>
------------<br>
<br>
Paul Turner &lt;<a href=3D"mailto:pjt@google.com" target=3D"_blank" rel=3D"=
noreferrer">pjt@google.com</a>&gt;<br>
Neel Natu &lt;<a href=3D"mailto:neelnatu@google.com" target=3D"_blank" rel=
=3D"noreferrer">neelnatu@google.com</a>&gt;<br>
Patrick Bellasi &lt;<a href=3D"mailto:derkling@google.com" target=3D"_blank=
" rel=3D"noreferrer">derkling@google.com</a>&gt;<br>
Hao Luo &lt;<a href=3D"mailto:haoluo@google.com" target=3D"_blank" rel=3D"n=
oreferrer">haoluo@google.com</a>&gt;<br>
Dimitrios Skarlatos &lt;<a href=3D"mailto:dskarlat@cs.cmu.edu" target=3D"_b=
lank" rel=3D"noreferrer">dskarlat@cs.cmu.edu</a>&gt;<br>
<br>
<br>
Patchset<br>
--------<br>
<br>
This patchset is on top of bpf/for-next as of 2024-04-29:<br>
<br>
=C2=A0 07801a24e2f1 (&quot;bpf, docs: Clarify PC use in instruction-set.rst=
&quot;)<br>
<br>
and contains the following patches:<br>
<br>
NOTE: The doc added by 0038 contains a high-level overview and might be goo=
d<br>
=C2=A0 =C2=A0 =C2=A0 place to start.<br>
<br>
0001-cgroup-Implement-cgroup_show_cftypes.patch<br>
0002-sched-Restructure-sched_class-order-sanity-checks-in.patch<br>
0003-sched-Allow-sched_cgroup_fork-to-fail-and-introduce-.patch<br>
0004-sched-Add-sched_class-reweight_task.patch<br>
0005-sched-Add-sched_class-switching_to-and-expose-check_.patch<br>
0006-sched-Factor-out-cgroup-weight-conversion-functions.patch<br>
0007-sched-Expose-css_tg-and-__setscheduler_prio.patch<br>
0008-sched-Enumerate-CPU-cgroup-file-types.patch<br>
0009-sched-Add-reason-to-sched_class-rq_-on-off-line.patch<br>
0010-sched-Factor-out-update_other_load_avgs-from-__updat.patch<br>
0011-cpufreq_schedutil-Refactor-sugov_cpu_is_busy.patch<br>
0012-sched-Add-normal_policy.patch<br>
0013-sched_ext-Add-boilerplate-for-extensible-scheduler-c.patch<br>
0014-sched_ext-Implement-BPF-extensible-scheduler-class.patch<br>
0015-sched_ext-Add-scx_simple-and-scx_example_qmap-exampl.patch<br>
0016-sched_ext-Add-sysrq-S-which-disables-the-BPF-schedul.patch<br>
0017-sched_ext-Implement-runnable-task-stall-watchdog.patch<br>
0018-sched_ext-Allow-BPF-schedulers-to-disallow-specific-.patch<br>
0019-sched_ext-Print-sched_ext-info-when-dumping-stack.patch<br>
0020-sched_ext-Print-debug-dump-after-an-error-exit.patch<br>
0021-tools-sched_ext-Add-scx_show_state.py.patch<br>
0022-sched_ext-Implement-scx_bpf_kick_cpu-and-task-preemp.patch<br>
0023-sched_ext-Add-a-central-scheduler-which-makes-all-sc.patch<br>
0024-sched_ext-Make-watchdog-handle-ops.dispatch-looping-.patch<br>
0025-sched_ext-Add-task-state-tracking-operations.patch<br>
0026-sched_ext-Implement-tickless-support.patch<br>
0027-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch<br>
0028-sched_ext-Add-cgroup-support.patch<br>
0029-sched_ext-Add-a-cgroup-scheduler-which-uses-flattene.patch<br>
0030-sched_ext-Implement-SCX_KICK_WAIT.patch<br>
0031-sched_ext-Implement-sched_ext_ops.cpu_acquire-releas.patch<br>
0032-sched_ext-Implement-sched_ext_ops.cpu_online-offline.patch<br>
0033-sched_ext-Bypass-BPF-scheduler-while-PM-events-are-i.patch<br>
0034-sched_ext-Implement-core-sched-support.patch<br>
0035-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch<br>
0036-sched_ext-Implement-DSQ-iterator.patch<br>
0037-sched_ext-Add-cpuperf-support.patch<br>
0038-sched_ext-Documentation-scheduler-Document-extensibl.patch<br>
0039-sched_ext-Add-selftests.patch<br>
<br>
0001=C2=A0 =C2=A0 =C2=A0: Cgroup prep.<br>
<br>
0002-0012: Scheduler prep.<br>
<br>
0013-0015: sched_ext core implementation and a couple example BPF scheduler=
.<br>
<br>
0016-0021: Utility features including safety mechanisms, switch-all and<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0printing sched_ext state when dump=
ing backtraces.<br>
<br>
0022-0027: Kicking and preempting other CPUs, task state transition trackin=
g<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0and tickless support. Demonstrated=
 with an example central<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scheduler which makes all scheduli=
ng decisions on one CPU.<br>
<br>
0029-0030: cgroup support and the ability to wait for other CPUs after<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0kicking them.<br>
<br>
0031-0033: Add CPU preemption and hotplug and power-management support.<br>
<br>
0034=C2=A0 =C2=A0 =C2=A0: Add core-sched support.<br>
<br>
0035-0036: Add DSQ rbtree and iterator support.<br>
<br>
0037=C2=A0 =C2=A0 =C2=A0: Add cpuperf (frequency scaling) support.<br>
<br>
0038=C2=A0 =C2=A0 =C2=A0: Add documentation.<br>
<br>
0039=C2=A0 =C2=A0 =C2=A0: Add selftests.<br>
<br>
The patchset is also available in the following git branch:<br>
<br>
=C2=A0git://<a href=3D"http://git.kernel.org/pub/scm/linux/kernel/git/tj/sc=
hed_ext.git" rel=3D"noreferrer noreferrer" target=3D"_blank">git.kernel.org=
/pub/scm/linux/kernel/git/tj/sched_ext.git</a> sched_ext-v6<br>
<br>
diffstat follows.<br>
<br>
=C2=A0Documentation/scheduler/index.rst=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 1<br>
=C2=A0Documentation/scheduler/sched-ext.rst=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0|=C2=A0 307 +<br>
=C2=A0MAINTAINERS=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=
=C2=A0 =C2=A013<br>
=C2=A0Makefile=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 |=C2=A0 =C2=A0 8<br>
=C2=A0drivers/tty/sysrq.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 1<br>
=C2=A0include/asm-generic/vmlinux.lds.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 1<br>
=C2=A0include/linux/cgroup-defs.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 8<br>
=C2=A0include/linux/cgroup.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 5<br>
=C2=A0include/linux/sched.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 5<br>
=C2=A0include/linux/sched/ext.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 210<br>
=C2=A0include/linux/sched/task.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 3<br>
=C2=A0include/uapi/linux/sched.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 1<br>
=C2=A0init/Kconfig=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =
=C2=A0 5<br>
=C2=A0init/init_task.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A012<br=
>
=C2=A0kernel/Kconfig.preempt=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A024<br>
=C2=A0kernel/cgroup/cgroup.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A097<br>
=C2=A0kernel/fork.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =
=C2=A017<br>
=C2=A0kernel/sched/build_policy.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 8<br>
=C2=A0kernel/sched/core.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 324 +<br>
=C2=A0kernel/sched/cpufreq_schedutil.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A050<br>
=C2=A0kernel/sched/deadline.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 4<br>
=C2=A0kernel/sched/debug.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 3<br>
=C2=A0kernel/sched/ext.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 6641 +++++++++++++++=
++++++++++++++++<br>
=C2=A0kernel/sched/ext.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 139<br>
=C2=A0kernel/sched/fair.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A025<br>
=C2=A0kernel/sched/idle.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 2<br>
=C2=A0kernel/sched/rt.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 4<br>
=C2=A0kernel/sched/sched.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 123<br>
=C2=A0kernel/sched/topology.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 4<br>
=C2=A0lib/dump_stack.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 1<br=
>
=C2=A0tools/Makefile=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A01=
0<br>
=C2=A0tools/sched_ext/.gitignore=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 2<br>
=C2=A0tools/sched_ext/Makefile=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 246 +<br>
=C2=A0tools/sched_ext/README.md=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 270 +<br>
=C2=A0tools/sched_ext/include/bpf-compat/gnu/stubs.h=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A011<br=
>
=C2=A0tools/sched_ext/include/scx/common.bpf.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 301 +<br>
=C2=A0tools/sched_ext/include/scx/common.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 |=C2=A0 =C2=A071<br>
=C2=A0tools/sched_ext/include/scx/compat.bpf.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 110<br>
=C2=A0tools/sched_ext/include/scx/compat.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 |=C2=A0 197<br>
=C2=A0tools/sched_ext/include/scx/user_exit_info.h=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 111<br>
=C2=A0tools/sched_ext/scx_central.bpf.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0|=C2=A0 361 +<br>
=C2=A0tools/sched_ext/scx_central.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 135<br>
=C2=A0tools/sched_ext/scx_flatcg.bpf.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 |=C2=A0 939 ++++<br>
=C2=A0tools/sched_ext/scx_flatcg.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 233 +<br>
=C2=A0tools/sched_ext/scx_flatcg.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A051<br>
=C2=A0tools/sched_ext/scx_qmap.bpf.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 |=C2=A0 673 +++<br>
=C2=A0tools/sched_ext/scx_qmap.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 150<br>
=C2=A0tools/sched_ext/scx_show_state.py=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A039<br>
=C2=A0tools/sched_ext/scx_simple.bpf.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 |=C2=A0 149<br>
=C2=A0tools/sched_ext/scx_simple.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 107<br>
=C2=A0tools/testing/selftests/sched_ext/.gitignore=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =
6<br>
=C2=A0tools/testing/selftests/sched_ext/Makefile=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 216=
 +<br>
=C2=A0tools/testing/selftests/sched_ext/config=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 =C2=A0 9<br>
=C2=A0tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A042<br>
=C2=A0tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A057<br>
=C2=A0tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A039<br>
=C2=A0tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A056<br>
=C2=A0tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c=C2=A0 =
=C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A021<br>
=C2=A0tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A060<br>
=C2=A0tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A043<br>
=C2=A0tools/testing/selftests/sched_ext/enq_select_cpu_fails.c=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A061<br>
=C2=A0tools/testing/selftests/sched_ext/exit.bpf.c=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A08=
4<br>
=C2=A0tools/testing/selftests/sched_ext/exit.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 =C2=A055<br>
=C2=A0tools/testing/selftests/sched_ext/exit_test.h=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A020=
<br>
=C2=A0tools/testing/selftests/sched_ext/hotplug.bpf.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A055<br>
=C2=A0tools/testing/selftests/sched_ext/hotplug.c=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 1=
68<br>
=C2=A0tools/testing/selftests/sched_ext/hotplug_test.h=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A015<br>
=C2=A0tools/testing/selftests/sched_ext/init_enable_count.bpf.c=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A053<br>
=C2=A0tools/testing/selftests/sched_ext/init_enable_count.c=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 166<br>
=C2=A0tools/testing/selftests/sched_ext/maximal.bpf.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 164<br>
=C2=A0tools/testing/selftests/sched_ext/maximal.c=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =
=C2=A051<br>
=C2=A0tools/testing/selftests/sched_ext/maybe_null.bpf.c=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A026<br>
=C2=A0tools/testing/selftests/sched_ext/maybe_null.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A040<br=
>
=C2=A0tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A025<br>
=C2=A0tools/testing/selftests/sched_ext/minimal.bpf.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A021<br>
=C2=A0tools/testing/selftests/sched_ext/minimal.c=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =
=C2=A058<br>
=C2=A0tools/testing/selftests/sched_ext/prog_run.bpf.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A032<br>
=C2=A0tools/testing/selftests/sched_ext/prog_run.c=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A07=
8<br>
=C2=A0tools/testing/selftests/sched_ext/reload_loop.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A075<br>
=C2=A0tools/testing/selftests/sched_ext/runner.c=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 201=
<br>
=C2=A0tools/testing/selftests/sched_ext/scx_test.h=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 131<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A040<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dfl.c=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A072<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c=C2=
=A0 =C2=A0|=C2=A0 =C2=A089<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c=C2=A0 =
=C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A072<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A041<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch.c=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A070<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c |=
=C2=A0 =C2=A037<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c=C2=A0=
 =C2=A0 =C2=A0|=C2=A0 =C2=A056<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c |=
=C2=A0 =C2=A038<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c=C2=A0=
 =C2=A0 =C2=A0|=C2=A0 =C2=A056<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A092<br>
=C2=A0tools/testing/selftests/sched_ext/select_cpu_vtime.c=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A059<br>
=C2=A0tools/testing/selftests/sched_ext/test_example.c=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A049<br>
=C2=A0tools/testing/selftests/sched_ext/util.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 =C2=A071<br>
=C2=A0tools/testing/selftests/sched_ext/util.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=
=A0 =C2=A013<br>
=C2=A096 files changed, 15056 insertions(+), 139 deletions(-)<br>
<br>
<br>
Patchset History<br>
----------------<br>
<br>
v4 (<a href=3D"http://lkml.kernel.org/r/20230711011412.100319-1-tj@kernel.o=
rg" rel=3D"noreferrer noreferrer" target=3D"_blank">http://lkml.kernel.org/=
r/20230711011412.100319-1-tj@kernel.org</a>) -&gt; v5:<br>
<br>
- Updated to rebase on top of the current bpf/for-next (2023-11-06).<br>
=C2=A0 &#39;0002-0010: Scheduler prep&#39; were simply rebased on top of ne=
w EEVDF<br>
=C2=A0 scheduler which demonstrate clean cut API boundary between sched-ext=
 and<br>
=C2=A0 sched core.<br>
<br>
- To accommodate 32bit configs, fields which use atomic ops and<br>
=C2=A0 store_release/load_acquire are switched from 64bits to longs.<br>
<br>
- To help triaging, if sched_ext is enabled, backtrace dumps now show the<b=
r>
=C2=A0 currently active scheduler along with some debug information.<br>
<br>
- Fixes for bugs including p-&gt;scx.flags corruption due to unsynchronized=
<br>
=C2=A0 SCX_TASK_DSQ_ON_PRIQ changes, and overly permissive BTF struct and s=
cx_bpf<br>
=C2=A0 kfunc access checks.<br>
<br>
- Other misc changes including renaming &quot;type&quot; to &quot;kind&quot=
; in scx_exit_info to<br>
=C2=A0 ease usage from rust and other languages in which &quot;type&quot; i=
s a reserved<br>
=C2=A0 keyword.<br>
<br>
- scx_atropos is renamed to scx_rusty and received signficant updates to<br=
>
=C2=A0 improve scalability. Load metrics are now tracked in BPF and accesse=
d only<br>
=C2=A0 as necessary from userspace.<br>
<br>
- Misc example scheduler improvements including the usage of resizable BPF<=
br>
=C2=A0 .bss array, the introduction of SCX_BUG[_ON](), and timer CPU pinnin=
g in<br>
=C2=A0 scx_central.<br>
<br>
- Improve Makefile and documentation for example schedulers.<br>
<br>
v3 (<a href=3D"https://lkml.kernel.org/r/20230317213333.2174969-1-tj@kernel=
.org" rel=3D"noreferrer noreferrer" target=3D"_blank">https://lkml.kernel.o=
rg/r/20230317213333.2174969-1-tj@kernel.org</a>) -&gt; v4:<br>
<br>
- There aren&#39;t any significant changes to the sched_ext API even though=
 we<br>
=C2=A0 kept experimenting heavily with a couple BPF scheduler implementatio=
ns<br>
=C2=A0 indicating that the core API reached a level of maturity.<br>
<br>
- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch which<br>
=C2=A0 implemented custom guard scope for scheduler attribute changes dropp=
ed as<br>
=C2=A0 upstream is moving towards a more generic implementation.<br>
<br>
- Build fixes with different CONFIG combinations.<br>
<br>
- Core code cleanups and improvements including how idle CPU is selected an=
d<br>
=C2=A0 disabling ttwu_queue for tasks on SCX to avoid confusing BPF schedul=
ers<br>
=C2=A0 expecting -&gt;select_cpu() call. See<br>
=C2=A0 0012-sched_ext-Implement-BPF-extensible-scheduler-class.patch for mo=
re<br>
=C2=A0 details.<br>
<br>
- &quot;_example&quot; dropped from the example schedulers as the distincti=
on between<br>
=C2=A0 the example-only and practically-useful isn&#39;t black-and-white. I=
nstead,<br>
=C2=A0 each scheduler has detailed comments and there&#39;s also a README f=
ile.<br>
<br>
- scx_central, scx_pair and scx_flatcg are moved into their own patches as<=
br>
=C2=A0 suggested by Josh Don.<br>
<br>
- scx_atropos received sustantial updates including fixes for bugs that<br>
=C2=A0 could cause temporary stalls and improvements in load balancing and =
wakeup<br>
=C2=A0 target CPU selection. For details, See<br>
=C2=A0 0034-sched_ext-Add-a-rust-userspace-hybrid-example-schedu.patch.<br>
<br>
v2 (<a href=3D"http://lkml.kernel.org/r/20230128001639.3510083-1-tj@kernel.=
org" rel=3D"noreferrer noreferrer" target=3D"_blank">http://lkml.kernel.org=
/r/20230128001639.3510083-1-tj@kernel.org</a>) -&gt; v3:<br>
<br>
- ops.set_weight() added to allow BPF schedulers to track weight changes<br=
>
=C2=A0 without polling p-&gt;scx.weight.<br>
<br>
- scx_bpf_task_cgroup() kfunc added to allow BPF scheduler to reliably<br>
=C2=A0 determine the current cpu cgroup under rq lock protection. This requ=
ired<br>
=C2=A0 improving the kf_mask SCX operation verification mechanism and addin=
g<br>
=C2=A0 0023-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch.<br>
<br>
- Updated to use the latest BPF improvements including KF_RCU and the inlin=
e<br>
=C2=A0 iterator.<br>
<br>
- scx_example_flatcg added to 0024-sched_ext-Add-cgroup-support.patch. It<b=
r>
=C2=A0 uses the new BPF RB tree support to implement flattened cgroup hiera=
rchy.<br>
<br>
- A DSQ now also contains an rbtree so that it can be used to implement<br>
=C2=A0 vtime based scheduling among tasks sharing a DSQ conveniently and<br=
>
=C2=A0 efficiently. For more details, see<br>
=C2=A0 0029-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch. All=
<br>
=C2=A0 eligible example schedulers are updated to default to weighted vtime=
<br>
=C2=A0 scheduilng.<br>
<br>
- atropos scheduler&#39;s userspace code is substantially restructred and<b=
r>
=C2=A0 rewritten. The binary is renamed to scx_atropos and can auto-config =
the<br>
=C2=A0 domains according to the cache topology.<br>
<br>
- Various other example scheduler updates including scx_example_dummy being=
<br>
=C2=A0 renamed to scx_example_simple, the example schedulers defaulting to<=
br>
=C2=A0 enabling switch_all and clarifying performance expectation of each e=
xample<br>
=C2=A0 scheduler.<br>
<br>
- A bunch of fixes and improvements. Please refer to each patch for details=
.<br>
<br>
v1 (<a href=3D"http://lkml.kernel.org/r/20221130082313.3241517-1-tj@kernel.=
org" rel=3D"noreferrer noreferrer" target=3D"_blank">http://lkml.kernel.org=
/r/20221130082313.3241517-1-tj@kernel.org</a>) -&gt; v2:<br>
<br>
- Rebased on top of bpf/for-next - a5f6b9d577eb (&quot;Merge branch &#39;En=
able<br>
=C2=A0 struct_ops programs to be sleepable&#39;&quot;). There were several =
missing<br>
=C2=A0 features including generic cpumask helpers and sleepable struct_ops<=
br>
=C2=A0 operation support that v1 was working around. The rebase gets rid of=
 all<br>
=C2=A0 SCX specific temporary helpers.<br>
<br>
- Some kfunc helpers are context-sensitive and can only be called from<br>
=C2=A0 specific operations. v1 didn&#39;t restrict kfunc accesses allowing =
them to be<br>
=C2=A0 misused which can lead to crashes and other malfunctions. v2 makes m=
ore<br>
=C2=A0 kfuncs safe to be called from anywhere and implements per-task mask =
based<br>
=C2=A0 runtime access control for the rest. The longer-term plan is to make=
 the<br>
=C2=A0 BPF verifier enforce these restrictions. Combined with the above, sa=
ns<br>
=C2=A0 mistakes and bugs, it shouldn&#39;t be possible to crash the machine=
 through<br>
=C2=A0 SCX and its helpers.<br>
<br>
- Core-sched support. While v1 implemented the pick_task operation, there<b=
r>
=C2=A0 were multiple missing pieces for working core-sched support. v2 adds=
<br>
=C2=A0 0027-sched_ext-Implement-core-sched-support.patch. SCX by default<br=
>
=C2=A0 implements global FIFO ordering and allows the BPF schedulers to imp=
lement<br>
=C2=A0 custom ordering via scx_ops.core_sched_before(). scx_example_qmap is=
<br>
=C2=A0 updated so that the five queues&#39; relative priorities are correct=
ly<br>
=C2=A0 reflected when core-sched is enabled.<br>
<br>
- Dropped balance_scx_on_up() which was called from put_prev_task_balance()=
.<br>
=C2=A0 UP support is now contained in SCX proper.<br>
<br>
- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch adds<br>
=C2=A0 SCHED_CHANGE_BLOCK() which encapsulates the preparation and restorat=
ion<br>
=C2=A0 sequences used for task attribute changes. For SCX, this replaces<br=
>
=C2=A0 sched_deq_and_put_task() and sched_enq_and_set_task() from v1.<br>
<br>
- 0011-sched-Add-reason-to-sched_move_task.patch dropped from v1. SCX now<b=
r>
=C2=A0 distinguishes cgroup and autogroup tg&#39;s using task_group_is_auto=
group().<br>
<br>
- Other misc changes including fixes for bugs that Julia Lawall noticed and=
<br>
=C2=A0 patch descriptions updates with more details on how the introduced c=
hanges<br>
=C2=A0 are going to be used.<br>
<br>
- MAINTAINERS entries added.<br>
<br>
The followings are discussion points which were raised but didn&#39;t resul=
t in<br>
code changes in this iteration.<br>
<br>
- There were discussions around exposing __setscheduler_prio() and, in v2,<=
br>
=C2=A0 SCHED_CHANGE_BLOCK() in kernel/sched/sched.h. Switching scheduler<br=
>
=C2=A0 implementations is innate for SCX. At the very least, it needs to be=
 able<br>
=C2=A0 to turn on and off the BPF scheduler which requires something equiva=
lent<br>
=C2=A0 to SCHED_CHANGE_BLOCK(). The use of __setscheduler_prio() depends on=
 the<br>
=C2=A0 behavior we want to present to userspace. The current one of using C=
FS as<br>
=C2=A0 the fallback when BPF scheduler is not available seems more friendly=
 and<br>
=C2=A0 less error-prone to other options.<br>
<br>
- Another discussion point was around for_each_active_class() and friends<b=
r>
=C2=A0 which skip over CFS or SCX when it&#39;s known that the sched_class =
must be<br>
=C2=A0 empty. I left it as-is for now as it seems to be cleaner and more ro=
bust<br>
=C2=A0 than trying to plug each operation which may added unnecessary overh=
eads.<br>
<br>
Thanks.<br>
<br>
--<br>
tejun<br>
<br>
</div>

--000000000000e4d5e60617decdd5--