The Other CPU Cache
We’ve seen how using the CPU’s cache can lead to a 13x speedup, but that’s only utilizing one of the CPU’s cache types. Today’s article shows you how to go further by utilizing a whole other type of CPU caching!
Consumer CPUs since the 1980s have had two kinds of caches on the chip itself to speed up access to slow main RAM. The first is the data cache I wrote about two weeks ago. In a nutshell, whenever you access memory the CPU first checks its cache to see if what you need is there and, if it is, avoids accessing RAM. A cache “miss” means the CPU “stalls” (i.e. does nothing) while it waits a really long time for a whole “cache line” (e.g. 32 or 64 bytes) to be fetched from RAM. This is called the “data cache” and it caches the data your program stores in RAM.
The other cache is the “instruction cache”. Just like the data cache, it stores the instructions that make up your program. As the CPU executes your instructions it pulls them from RAM. To speed this up, it first checks the instruction cache. This all behaves very similarly to the data cache.
Normally your instructions are just a linear list of commands with “jumps” (e.g. a C# `if` or `while`) that only go a little way. You might skip forward or backward a few dozen instructions, but you probably won’t skip a million. Even if you do, modern CPUs will read ahead of the current instruction to see that you’re about to jump to a far-flung function and try to load that code ahead of time so it’s already in the instruction cache by the time you need it. That means code is normally very well set up for localized execution that repeatedly reads from the same lines of the instruction cache.
So what can go wrong? The problems start to happen when we write code where the execution order is determined at runtime. The code the CPU ends up executing says “read this memory location and jump to the address you find there”. That means the instruction to jump to might change, so CPUs have a hard time pre-loading the right memory into the instruction cache.
In C# there are two main ways to trigger this. The first is the most literal version of this: delegates. Delegates (e.g.
Action) are variables that can be changed at runtime to “point” to different functions. They can even point to multiple functions. So when the CPU gets to a delegate it might not be able to predict which function it should load into the instruction cache and might need to stall while it waits for the right function to load from RAM.
The second way to trigger this is less obvious: virtual functions. These include functions marked
virtual, functions marked
override to override a virtual function, and any function implementing a function defined in an interface. Consider the classic polymorphism example where
IAnimal has a
void Speak() and
Dog classes implement
IAnimal by defining their own
Speak functions that print “meow” and “woof”. Then some other code has an
IAnimal variable and calls
animal.Speak(). The way this works behind the scenes is that both
Cat instances contain a function pointer (essentially a low-level delegate) that points to their own version of
Speak. This too trips up the CPU, causing it to frequently wait on main memory just to know what to execute next.
Unfortunately, it’s rather hard to create a small test script for the purposes of demonstrating instruction cache misses. To do so requires something more like a real game that has a lot more complexity than I can put here in this article. For example, say you had an
IEntity with 50 classes implementing it:
Obstacle, etc. If you loop over an
IPlayer calling a virtual
Update then you’ll constantly be telling the CPU to jump to different areas of RAM to find your instructions on how to update each of those types. If, however, you only had a few classes implementing
IEntity then all of those
Update functions might fit in the instruction cache. Hence the need for more code than is practical for this article.
By the way, a more instruction cache-friendly way to call
Update on all the
IEntity objects is to sort them by type and call non-virtual functions on each of them. This presumes that their update order doesn’t need to be very interleaved by entity type, but that’s often the case.
So keep the instruction cache in mind as you focus on performance. If you can arrange your code to avoid virtual functions and delegates then you’ll have a nice performance win on your hands.
#1 by Adam on May 22nd, 2017 ·
Hmm… this one’s a bit tougher to wrap my head around than previous articles. I’m wondering at what poin this technique would become noticeable.
For example, in my current project (that is in dire need of optimization), I can sometimes have upwards of 200 objects using functions defined in an interface in a scene. That Interface is implemented by ~15 different classes. In this case, do you think removing the dependence on virtual functions would yield any significant time savings?
#2 by jackson on May 22nd, 2017 ·
It really depends on the size of your functions (i.e. how much needs to be loaded into the instruction cache), the CPU you’re running on (i.e. the instruction cache architecture, size, and settings), and what you consider significant time savings. You may have instruction cache misses with just 15 functions. You probably have a lot of mis-predictions since it’s hard for the CPU to tell which instructions it should execute next. You’re almost certainly incurring overhead for indirect function calls.
Say your classes are A, B, C, and D. You have 15 so keep going with the letters. Now say you have an array of class instances like this: B, A, C, A, A, D, B, D, D, D, A, etc. If your logic allows for it, rearrange into one array per type so you have on array with A, A, A, A, one array with B, B, B, B, etc. Now loop over the A array doing work on them, then the B array, etc. Since you know the type of elements in each array you don’t need virtual function calls. There’s no chance of instruction cache misses, you’ll have fewer mis-predictions, and you won’t have any indirect function call overhead.
You should also consider converting these classes into structs so that you can tightly pack them in the arrays. It’ll probably be a lot easier to do this once you’ve removed the virtual functions. You could use threads too, but at only 200 items you probably won’t need to. As always, profile frequently so you know what problem you have and how effective your changes are. I don’t know your app so I could be way off. :)
#3 by Kaiyum on May 23rd, 2017 ·
In a VM environment like .net, how sure are we about those delegates, interfaces etc? A concrete test result would awesomely confirm this article.
#4 by jackson on May 23rd, 2017 ·
It’s easy to check out the code with IL2CPP. For example, if you build for iOS and open up the project in Xcode you can search for “YourClass::YourFunction” and find the C++ version of your C#. Oftentimes you’ll see a function call like
VirtActionInvoker1< float >::Invokefor your virtual function call. It’s ugly code, but you can generally figure out what’s going on.
#5 by Chris on June 10th, 2017 ·
Does this still apply when hiding / overriding functions using the
#6 by jackson on June 11th, 2017 ·
newfunctions aren’t virtual. For example:
The last print is because the actual type of
Dogisn’t taken into account at runtime due to the lack of virtual function. The compiler simply says “that’s a
Dog, so call
overridethen you’d see this
The important lines are the ones with calls to
VirtActionInvoker0::Invoke, which calls a virtual function.