Apple’s M1 is a fast CPU—but M1 Macs feel even faster due to QoS

being fast is great—feeling fast is even better —

Howard Oakley did an excellent deep dive on M1 scheduling and performance.

Jim Salter

Enlarge / The Apple M1 is a world-class processor—but it feels even faster than its already-great specs imply. Howard Oakley did a deep-dive investigation to find out why.

Apple’s M1 processor is a world-class desktop and laptop processor—but when it comes to general-purpose end-user systems, there’s something even better than being fast. We’re referring, of course, to feeling fast—which has more to do with a system meeting user expectations predictably and reliably than it does with raw speed.

Howard Oakley—author of several Mac-native utilities such as Cormorant, Spundle, and Stibium—did some digging to find out why his M1 Mac felt faster than Intel Macs did, and he concluded that the answer is QoS. If you’re not familiar with the term, it’s short for Quality of Service—and it’s all about task scheduling.

More throughput doesn’t always mean happier users

There’s a very common tendency to equate “performance” with throughput—roughly speaking, tasks accomplished per unit of time. Although throughput is generally the easiest metric to measure, it doesn’t correspond very well to human perception. What humans generally notice isn’t throughput, it’s latency—not the number of times a task can be accomplished, but the time it takes to complete an individual task.

Here at Ars, our own Wi-Fi testing metrics follow this concept—we measure the amount of time it takes to load an emulated webpage under reasonably normal network conditions rather than measuring the number of times a webpage (or anything else) can be loaded per second while running flat out.

We can also see a negative example—one in which the fastest throughput corresponded to distinctly unhappy users—with the circa-2006 introduction of the Completely Fair Queue (cfq) I/O scheduler in the Linux kernel. cfq can be tuned extensively, but in its out-of-box configuration, it maximizes throughput by reordering disk reads and writes to minimize seeking, then offering round-robin service to all active processes.

Unfortunately, while cfq did in fact measurably improve maximum throughput, it did so at the increase of task latency—which meant that a moderately loaded system felt sluggish and unresponsive to its users, leading to a large groundswell of complaints.

Although cfq could be tuned for lower latency, most unhappy users just replaced it entirely with a competing scheduler like noop or deadline instead—and despite the lower maximum throughput, the decreased individual latency made desktop/interactive users happier with how fast their machines felt.

After discovering how suboptimal maximized throughput at the expense of latency was, most Linux distributions moved away from cfq just as many of their users had. Red Hat ditched cfq for deadline in 2013, as did RHEL 7—and Ubuntu followed suit shortly thereafter in its 2014 Trusty Tahr (14.04) release. As of 2019, Ubuntu has deprecated cfq entirely.

QoS with Big Sur and the Apple M1

When Oakley noticed how frequently Mac users praised M1 Macs for feeling incredibly fast—despite performance measurements that don’t always back those feelings up—he took a closer look at macOS native task scheduling.

MacOS offers four directly specified levels of task prioritization—from low to high, they are background, utility, userInitiated, and userInteractive. There’s also a fifth level (the default, when no QoS level is manually specified) which allows macOS to decide for itself how important a task is.

These five QoS levels are the same whether your Mac is Intel-powered or Apple Silicon-powered—but how the QoS is imposed changes. On an eight-core Intel Xeon W CPU, if the system is idle, macOS will schedule any task across all eight cores, regardless of QoS settings. But on an M1, even if the system is entirely idle, background priority tasks run exclusively on the M1’s four efficiency/low-power Icestorm cores, leaving the four higher-performance Firestorm cores idle.

Although this made the lower-priority tasks Oakley tested the system with—compression of a 10GB test file—slower on the M1 Mac than the Intel Mac, the operations were more consistent across the spectrum of “idle system” to “very busy system.”

Operations with higher QoS settings also performed more consistently on the M1 than Intel Mac—macOS’s willingness to dump lower-priority tasks onto the Icestorm cores only left the higher-performance Firestorm cores unloaded and ready to respond both rapidly and consistently when userInitiated and userInteractive tasks needed handling.

Conclusions

Apple’s QoS strategy for the M1 Mac is an excellent example of engineering for the actual pain point in a workload rather than chasing arbitrary metrics. Leaving the high-performance Firestorm cores idle when executing background tasks means that they can devote their full performance to the userInitiated and userInteractive tasks as they come in, avoiding the perception that the system is unresponsive or even “ignoring” the user.

It’s worth noting that Big Sur certainly could employ the same strategy with an eight-core Intel processor. Although there is no similar big/little split in core performance on x86, nothing is stopping an OS from arbitrarily declaring a certain number of cores to be background only. What makes the Apple M1 feel so fast isn’t the fact that four of its cores are slower than the others—it’s the operating system’s willingness to sacrifice maximum throughput in favor of lower task latency.

It’s also worth noting that the interactivity improvements M1 Mac users are seeing rely heavily on tasks being scheduled properly in the first place—if developers aren’t willing to use the low-priority background queue when appropriate because they don’t want their app to seem slow, everyone loses. Apple’s unusually vertical software stack likely helps significantly here, since Apple developers are more likely to prioritize overall system responsiveness even if it might potentially make their code “look bad” if very closely examined.

If you’re interested in more of the gritty details of how QoS levels are applied on M1 and Intel Macs—and the impact they make—we strongly recommend checking out Oakley’s original work here and here, complete with CPU History screenshots on the macOS Activity Monitor as Oakley runs tasks at various priorities on the two different architectures.

Read More

Related posts

Not Using a Repricer? Here’s What You Need to Know to Get Started

What are BTC Halvings, And How Do They Drive the Market?

Essential Software When Working with Remote Employees