being fast is great—feeling fast is even better —
Howard Oakley did an excellent deep dive on M1 scheduling and performance.
Jim Salter
–
Apple’s M1 processor is a world-class desktop and laptop processor—but when it comes to general-purpose end-user systems, there’s something even better than being fast. We’re referring, of course, to feeling fast—which has more to do with a system meeting user expectations predictably and reliably than it does with raw speed.
Howard Oakley—author of several Mac-native utilities such as Cormorant, Spundle, and Stibium—did some digging to find out why his M1 Mac felt faster than Intel Macs did, and he concluded that the answer is QoS. If you’re not familiar with the term, it’s short for Quality of Service—and it’s all about task scheduling.
More throughput doesn’t always mean happier users
There’s a very common tendency to equate “performance” with throughput—roughly speaking, tasks accomplished per unit of time. Although throughput is generally the easiest metric to measure, it doesn’t correspond very well to human perception. What humans generally notice isn’t throughput, it’s latency—not the number of times a task can be accomplished, but the time it takes to complete an individual task.
Here at Ars, our own Wi-Fi testing metrics follow this concept—we measure the amount of time it takes to load an emulated webpage under reasonably normal network conditions rather than measuring the number of times a webpage (or anything else) can be loaded per second while running flat out.
We can also see a negative example—one in which the fastest throughput corresponded to distinctly unhappy users—with the circa-2006 introduction of the Completely Fair Queue (cfq
) I/O scheduler in the Linux kernel. cfq
can be tuned extensively, but in its out-of-box configuration, it maximizes throughput by reordering disk reads and writes to minimize seeking, then offering round-robin service to all active processes.
Unfortunately, while cfq
did in fact measurably improve maximum throughput, it did so at the increase of task latency—which meant that a moderately loaded system felt sluggish and unresponsive to its users, leading to a large groundswell of complaints.
Although cfq
could be tuned for lower latency, most unhappy users just replaced it entirely with a competing scheduler like noop
or deadline
instead—and despite the lower maximum throughput, the decreased individual latency made desktop/interactive users happier with how fast their machines felt.
After discovering how suboptimal maximized throughput at the expense of latency was, most Linux distributions moved away from cfq
just as many of their users had. Red Hat ditched cfq
for deadline
in 2013, as did RHEL 7—and Ubuntu followed suit shortly thereafter in its 2014 Trusty Tahr
(14.04) release. As of 2019, Ubuntu has deprecated cfq
entirely.
QoS with Big Sur and the Apple M1
When Oakley noticed how frequently Mac users praised M1 Macs for feeling incredibly fast—despite performance measurements that don’t always back those feelings up—he took a closer look at macOS native task scheduling.
MacOS offers four directly specified levels of task prioritization—from low to high, they are background
, utility
, userInitiated
, and userInteractive
. There’s also a fifth level (the default, when no QoS level is manually specified) which allows macOS to decide for itself how important a task is.
These five QoS levels are the same whether your Mac is Intel-powered or Apple Silicon-powered—but how the QoS is imposed changes. On an eight-core Intel Xeon W CPU, if the system is idle, macOS will schedule any task across all eight cores, regardless of QoS settings. But on an M1, even if the system is entirely idle, background
priority tasks run exclusively on the M1’s four efficiency/low-power Icestorm
cores, leaving the four higher-performance Firestorm
cores idle.
Although this made the lower-priority tasks Oakley tested the system with—compression of a 10GB test file—slower on the M1 Mac than the Intel Mac, the operations were more consistent across the spectrum of “idle system” to “very busy system.”
Operations with higher QoS settings also performed more consistently on the M1 than Intel Mac—macOS’s willingness to dump lower-priority tasks onto the Icestorm
cores only left the higher-performance Firestorm
cores unloaded and ready to respond both rapidly and consistently when userInitiated
and userInteractive
tasks needed handling.
Conclusions
Apple’s QoS strategy for the M1 Mac is an excellent example of engineering for the actual pain point in a workload rather than chasing arbitrary metrics. Leaving the high-performance Firestorm
cores idle when executing background
tasks means that they can devote their full performance to the userInitiated
and userInteractive
tasks as they come in, avoiding the perception that the system is unresponsive or even “ignoring” the user.
It’s worth noting that Big Sur certainly could employ the same strategy with an eight-core Intel processor. Although there is no similar big/little split in core performance on x86, nothing is stopping an OS from arbitrarily declaring a certain number of cores to be background
only. What makes the Apple M1 feel so fast isn’t the fact that four of its cores are slower than the others—it’s the operating system’s willingness to sacrifice maximum throughput in favor of lower task latency.
It’s also worth noting that the interactivity improvements M1 Mac users are seeing rely heavily on tasks being scheduled properly in the first place—if developers aren’t willing to use the low-priority background
queue when appropriate because they don’t want their app to seem slow, everyone loses. Apple’s unusually vertical software stack likely helps significantly here, since Apple developers are more likely to prioritize overall system responsiveness even if it might potentially make their code “look bad” if very closely examined.
If you’re interested in more of the gritty details of how QoS levels are applied on M1 and Intel Macs—and the impact they make—we strongly recommend checking out Oakley’s original work here and here, complete with CPU History screenshots on the macOS Activity Monitor as Oakley runs tasks at various priorities on the two different architectures.