Oracle Linux kernel developer Steve Sistare contributes this dialogue on kernel scheduler enhancements.
Load balancing by way of scalable job stealing
The Linux job scheduler balances load throughout a system by pushing waking duties to idle CPUs, and by pulling duties from busy CPUs when a CPU turns into idle. Environment friendly scaling is a problem on each the push and pull sides on massive techniques. For pulls, the scheduler searches all CPUs in successively bigger domains till an overloaded CPU is discovered, and pulls a job from the busiest group. That is very costly, costing 10’s to 100’s of microseconds on massive techniques, so search time is restricted by the common idle time, and a few domains are usually not searched. Stability is just not at all times achieved, and idle CPUs go unused.
I’ve carried out an alternate mechanism that’s invoked after the present search in idle_balance() limits itself and finds nothing. I keep a bitmap of overloaded CPUs, the place a CPU units its bit when its runnable CFS job rely exceeds 1. The bitmap is sparse, with a restricted variety of vital bits per cacheline. This reduces cache competition when many threads concurrently set, clear, and go to parts. There’s a bitmap per last-level cache. When a CPU turns into idle, it searches the bitmap to seek out the primary overloaded CPU with a migratable job, and steals it. This easy stealing yields the next CPU utilization than idle_balance() alone, as a result of the search is affordable, costing 1 to 2 microseconds, so it might be referred to as each time the CPU is about to go idle. Stealing doesn’t offload the globally busiest queue, however it’s a lot better than working nothing in any respect.
Stealing improves utilization with solely a modest CPU overhead in scheduler code. Within the following experiment, hackbench is run with various numbers of teams (40 duties per group), and the delta in /proc/schedstat is proven for every run, averaged per CPU, augmented with these non-standard stats:
%discover – p.c of time spent in outdated and new capabilities that seek for idle CPUs and duties to steal and set the overloaded CPUs bitmap.
steal – variety of instances a job is stolen from one other CPU. Elapsed time improves by eight to 36%, costing at most zero.four% extra discover time.
CPU busy utilization is near 100% for the brand new kernel, as proven by the inexperienced curve within the following graph, versus the orange curve for the baseline kernel:
Stealing improves Oracle database OLTP efficiency by as much as 9% relying on load, and we have now seen some good enhancements for mysql, pgsql, gcc, java, and networking. Usually, stealing is most useful for workloads with a excessive context swap fee.
As of this writing, this work is just not but upstream, however the newest patch collection is at https://lkml.org/lkml/2018/12/6/1253. In case your kernel is constructed with CONFIG_SCHED_DEBUG=y, you’ll be able to confirm that it accommodates the stealing optimization utilizing
# grep -q STEAL /sys/kernel/debug/sched_features && echo Sure
If you happen to attempt it, observe that stealing is disabled for techniques with greater than 2 NUMA nodes, as a result of hackbench regresses on such techniques, as I clarify in https://lkml.org/lkml/2018/12/6/1250 .Nevertheless, I believe this impact is particular to hackbench and that stealing will assist different workloads on many-node techniques. To attempt it, reboot with kernel parameter sched_steal_node_limit = eight (or bigger).
After the essential stealing algorithm is pushed upstream, I’m contemplating the next enhancements:
If stealing inside the last-level cache doesn’t discover a candidate, steal throughout LLC’s and NUMA nodes.
Preserve a sparse bitmap to establish stealing candidates within the RT scheduling class. At the moment pull_rt_task() searches all run queues.
Take away the core and socket ranges from idle_balance(), as stealing handles these ranges. Take away idle_balance() totally when stealing throughout LLC is supported.
Preserve a bitmap to establish idle cores and idle CPUs, for push balancing.
This text initially appeared at Oracle Builders Weblog.