Structs are great for controlling memory layout and avoiding the GC, but we can go a lot further to get even more speed out of them. Today we’ll look at a simple tweak that can dramatically speed up the code using the structs without even changing it!

Let’s start by designing a struct for a player in a game. First, we’ll use the object-oriented approach and put everything about the player in one big struct:

struct PlayerExtras
{
    public Vector3 Position;      // 3 * 4 = 12
    public Vector3 Velocity;      // 3 * 4 = 12
    public int Health;            //          4
    public int MaxHealth;         //          4
    public int NumLives;          //          4
    public int Score;             //          4
    public int TeamId;            //          4
    public int LeftHandWeaponId;  //          4
    public int RightHandWeaponId; //          4
    public int NumWins;           //          4
    public int NumLosses;         //          4
    public int MatchmakingRank;   //          4
                                  // Total:  64
}

The comments to the right of the fields indicate how many bytes each field takes up. This is 4 for all the int fields but 12 for Vector3 because it’s internally made up of 3 float fields which each take up 4 bytes.

Now let’s split off just the first four fields into their own struct. We’ll call this just the “basics” and define it like this:

struct PlayerBasics
{
    public Vector3 Position; // 3 * 4 = 12
    public Vector3 Velocity; // 3 * 4 = 12
    public int Health;       //          4
    public int MaxHealth;    //          4
                             // Total:  32
}

Now let’s consider some code that updates the position of the players in the game according to their velocity and some elapsed time: player.Position += player.Velocity * time. This is what we might do if the players were space ships, for example. We’ll simply loop over some arrays of them and performing this calculation and measuring the time with Stopwatch.

using System.Diagnostics;
using UnityEngine;
using Unity.IL2CPP.CompilerServices;
 
struct PlayerBasics
{
    public Vector3 Position; // 3 * 4 = 12
    public Vector3 Velocity; // 3 * 4 = 12
    public int Health;       //          4
    public int MaxHealth;    //          4
                             // Total:  32
}
 
struct PlayerExtras
{
    public Vector3 Position;      // 3 * 4 = 12
    public Vector3 Velocity;      // 3 * 4 = 12
    public int Health;            //          4
    public int MaxHealth;         //          4
    public int NumLives;          //          4
    public int Score;             //          4
    public int TeamId;            //          4
    public int LeftHandWeaponId;  //          4
    public int RightHandWeaponId; //          4
    public int NumWins;           //          4
    public int NumLosses;         //          4
    public int MatchmakingRank;   //          4
                                  // Total:  64
}
 
class TestScript : MonoBehaviour
{
    [Il2CppSetOption(Option.NullChecks, false)]
    [Il2CppSetOption(Option.ArrayBoundsChecks, false)]
    void Start()
    {
        const int count = 10000000;
        PlayerBasics[] basics = new PlayerBasics[count];
        PlayerExtras[] extras = new PlayerExtras[count];
        float time = Random.Range(0, 1);
 
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < count; ++i)
        {
            ref PlayerBasics player = ref basics[i];
            player.Position += player.Velocity * time;
        }
        long basicsTime = sw.ElapsedMilliseconds;
 
        sw.Reset();
        sw.Start();
        for (int i = 0; i < count; ++i)
        {
            ref PlayerExtras player = ref extras[i];
            player.Position += player.Velocity * time;
        }
        long extrasTime = sw.ElapsedMilliseconds;
 
        print(
            "Type,Timen" +
            "Basics," + basicsTime + "n" +
            "Extras," + extrasTime);
    }
}

Now let’s run this code to see how each loop performed. I ran using this environment:

  • 2.7 Ghz Intel Core i7-6820HQ
  • macOS 10.14.3
  • Unity 2018.3.1f1
  • macOS Standalone
  • .NET 4.x scripting runtime version and API compatibility level
  • IL2CPP
  • Non-development
  • 640×480, Fastest, Windowed

Here are the results I got:

Type Time
Basics 41
Extras 63

Loop Speeds

The “extras” version of the player struct makes the loop take 47% more CPU time! It’s the exact same code running on the exact same number of structs, so why is it so slow?

The answer lies in the size of the structs being iterated over. In order to perform the calculation, the code needs to read the Position and Velocity fields from memory. When it reads Position, the CPU doesn’t read just the x, y, and z floats but an entire “cache line” of 64 bytes. In the case of PlayerExtras, this means the CPU reads all the other fields of the struct that will never be used.

To illustrate, let’s assume the first PlayerExtras element of the extras array is in memory at address 100. Here’s how that memory would look:

Address Index Variable
100 0 Position.x
104 0 Position.y
108 0 Position.z
112 0 Velocity.x
116 0 Velocity.y
120 0 Velocity.z
124 0 Health
128 0 MaxHealth
132 0 NumLives
136 0 Score
140 0 TeamId
148 0 LeftHandWeaponId
152 0 RightHandWeaponId
156 0 NumWins
160 0 NumLosses
164 0 MatchmakingRank

This one struct takes up the entire cache line, so reading the next element of the array means it isn’t in cache and a new cache line needs to be fetched from RAM.

Now let’s look at PlayerBasics. Since it’s half the size, we see that two of them are fetched when the first one’s Position is read:

Address Index Variable
100 0 Position.x
104 0 Position.y
108 0 Position.z
112 0 Velocity.x
116 0 Velocity.y
120 0 Velocity.z
124 0 Health
128 0 MaxHealth
132 1 Position.x
136 1 Position.y
140 1 Position.z
148 1 Velocity.x
152 1 Velocity.y
156 1 Velocity.z
160 1 Health
164 1 MaxHealth

This means that the second element of the array can use the struct that’s already in cache rather than reaching out to RAM to fetch it.

As this page neatly shows, using the CPU cache for half of the elements of the array means we’re only waiting 1-4 nanoseconds for that memory instead of 100 nanoseconds to fetch it from RAM. That 25-100x speedup gives us a huge speed boost that makes using the “basics” version of the struct so much faster.

In general, try to keep the size of structs small if they need to be efficiently iterated over. This will improve the performance of the code even without changing it just by making more efficient use of the CPU cache.