Multi-threading is essential to performance on all modern processors. Using multiple threads brings along with it the challenge of synchronizing data access across those threads. Unity’s job system can do some of this for us, but it certainly doesn’t handle every case. For times when it doesn’t, C# provides us with a bunch of synchronization options. Which are fastest? Today we’ll find out!

Setting a Baseline

Today we’ll be comparing various synchronization mechanisms. In order to make sense of their performance, we’ll need to first set a baseline. This will be a test that doesn’t do any synchronization at all. Here’s how it looks

class TestScript : MonoBehaviour
{
    volatile int Value;
    const int reps = 1000000;
 
    void NoSync()
    {
        for (int i = 0; i < reps; ++i)
        {
            Value++;
        }
    }
}

The Value variable is the data that our threads will contend for. In a real game, this would be a much larger and more complicated set of data. For the purposes of today’s test, it’s just a single integer. As always, make sure to test with realistic data for the most accurate results.

Note that Value is volatile. This defeats various compiler optimizations such as replacing the loop with Value += reps. We want to keep the loop so that we can repeatedly access Value and generate enough work to be measurable in the performance test.

Speaking of the performance test, we now need to add a little bit of framework to run NoSync:

class TestScript : MonoBehaviour
{
    volatile int Value;
    const int reps = 1000000;
 
    void Start()
    {
        StringBuilder report = new StringBuilder(4096);
        report.AppendLine("Test,IsCorrect,One Thread Ticks,Two Thread Ticks");
        Stopwatch sw = Stopwatch.StartNew();
        Test(sw, report, "No Sync", NoSync);
        print(report);
    }
 
    void Test(Stopwatch sw, StringBuilder report, string label, ThreadStart t)
    {
        Thread threadA = new Thread(t);
        Value = 0;
        sw.Restart();
        threadA.Start();
        threadA.Join();
        long oneThreadTicks = sw.ElapsedTicks;
 
        Thread threadB = new Thread(t);
        Thread threadC = new Thread(t);
        Value = 0;
        sw.Restart();
        threadB.Start();
        threadC.Start();
        threadB.Join();
        threadC.Join();
        long twoThreadTicks = sw.ElapsedTicks;
 
        report
            .Append(label).Append(',')
            .Append(Value == reps*2).Append(',')
            .Append(oneThreadTicks).Append(',')
            .Append(twoThreadTicks).Append('n');
    }
 
    void NoSync()
    {
        for (int i = 0; i < reps; ++i)
        {
            Value++;
        }
    }
}

Start calls Test with each kind of synchronization. Currently, that’s just NoSync but we’ll soon be adding more test functions.

Test runs the test function as a thread in two ways. First, it runs it on the only thread accessing Value. Second, it runs it on two threads so there’s some contention between them for Value. It then adds a row to the CSV report.

Lock Statements

Perhaps the simplest form of synchronization is built right into C#: lock statements. Here’s how it looks:

readonly object lockObject = new object();
 
void lockStatement()
{
    for (int i = 0; i < reps; ++i)
    {
        lock (lockObject)
        {
            Value++;
        }
    }
}

All we needed to add was a lockObject which can be any object. A readonly plain object is a good choice so it doesn’t accidentally get re-assigned or used for anything else.

Then we just add lock (lockObject) around the access to the data being shared between threads: Value++. This compiles to something like this code:

bool __lockWasTaken = false;
try
{
    System.Threading.Monitor.Enter(lockObject, ref __lockWasTaken);
    Value++;
}
finally
{
    if (__lockWasTaken)
    {
        System.Threading.Monitor.Exit(lockObject);
    }
}

The lock syntax saves us a lot of typing, similar to other language features like foreach. We also get exception safety out of its finally block.

Finally, we add this to the test by just adding a line to Start:

Test(sw, report, "Lock Statement", lockStatement);
Mutex

Next up is the Mutex type:

Mutex mutex = new Mutex();
 
void mutexWaitOne()
{
    for (int i = 0; i < reps; ++i)
    {
        mutex.WaitOne();
        Value++;
        mutex.ReleaseMutex();
    }
}

WaitOne blocks until the mutex is acquired, after which the thread has exclusive access to Value and is free to modify it however it likes until we call ReleaseMutex. We just do the usual Value++.

Next, we can add this to the test with another line in Start:

Test(sw, report, "Mutex", mutexWaitOne);
Interlocked

Next up is the Interlocked class and its static CompareExchange method:

void compareExchange()
{
    for (int i = 0; i < reps; ++i)
    {
        while (true)
        {
            int init = Value;
            int sum = init + 1;
            if (Interlocked.CompareExchange(ref Value, sum, init) == init)
            {
                break;
            }
        }
    }
}

This technique is a little tricker to implement. We write an infinite loop that tries over and over to increment Value. It comes in a few parts.

First, we fetch Value from memory and store it in a local init variable.

Second, we modify init however we need to. In this case we just add one to get sum.

Third, we call CompareExchange with three parameters:

  1. ref Value: The memory address we want to change
  2. sum: The value we want to write
  3. init: The value to compare the first parameter (ref Value) to

CompareExchange reads the value at the memory address (ref Value) and compares it to the third parameter (init). If they are the same, it writes the second parameter (sum) to the memory address (ref Value). It then returns the value that was at the memory address (ref Value) when the function started. It does all of this atomically, so no changes to the value at the memory address (ref Value) will occur during these steps.

Why does this prevents corruption of Value in our case? Well, the problem with NoSync that we’re trying to avoid is a thread execution ordering like this:

Value = 100; // Initial state of the memory at Value
 
int aValue = Value; // Thread A reads the memory at Value: 100
int bValue = Value; // Thread B reads the memory at Value: 100
 
aValue++; // Thread A increments the value it read to 101
bValue++; // Thread B increments the value it read to 101
 
Value = aValue; // Thread A writes the new value (101) to the memory at Value
Value = bValue; // Thread B writes the new value (101) to the memory at Value
 
Value == 101; // Incorrect! Should be 102, but is 101

CompareExchange inserts a check to make sure the memory at Value is still the value it started with (100) before writing the updated value (aValue or bValue) to the memory at Value. The same thread execution ordering would look like this:

Value = 100; // Initial state of the memory at Value
 
int aValue = Value; // Thread A reads the memory at Value: 100
int bValue = Value; // Thread B reads the memory at Value: 100
 
int aNewValue = aValue + 1; // Thread A increments the value it read to 101
int bNewValue = bValue + 1; // Thread B increments the value it read to 101
 
// If the memory at Value is unchanged, which is TRUE
if (Value == aValue)
{
    // Thread A writes the new value (101) to the memory at Value
    Value = aNewValue;
}
 
// If the memory at Value is unchanged, which is FALSE
if (Value == bValue)
{
    // Thread B writes the new value (101) to the memory at Value
    // This is NOT executed
    Value = bNewValue;
}
else
{
    int bValue = Value; // Thread B reads the memory at Value: 101
    int bNewValue = bValue + 1; // Thread B increments the value it read to 102
 
    // If the memory at Value is unchanged, which is TRUE
    if (Value == bValue)
    {
        // Thread B writes the new value (102) to the memory at Value
        Value = bNewValue;
    }
}
 
Value == 102; // Correct! Thread B wrote 102 after retrying.

The above pseudo-code shows the infinite loop broken down into just the if and else cases that would execute in this hypothetical thread execution ordering.

The critical part is that Thread B detects that Thread A changed the memory at Value and starts all over. It re-fetches the memory at Value, adds one, checks again to make sure the memory at Value is what it started with this time, and finally writes its result now that the memory hasn’t changed. The result is now correct!

Finally, let’s add this to the test with a line in Start:

Test(sw, report, "CompareExchange", compareExchange);
SpinWait

The CompareExchange approach contains a tight loop that retries over and over until successful. This means that the thread will utilize its CPU core 100% until it successfully writes to Value. If there is a lot of contention, it might be good to back off and let the OS perform a context switch so another thread can use the CPU core. To do this, we’ll introduce a SpinWait:

void compareExchangeSpinOnce()
{
    for (int i = 0; i < reps; ++i)
    {
        SpinWait spin = new SpinWait();
        while (true)
        {
            int init = Value;
            int sum = init + 1;
            if (Interlocked.CompareExchange(ref Value, sum, init) == init)
            {
                break;
            }
            spin.SpinOnce();
        }
    }
}

The only difference here is that we call SpinOnce to wait a few cycles before retrying. This gives us a better shot at successfully retrying the write to Value. It will also allow a CPU context switch if waiting too long or when running on a single-core CPU where waiting is pointless.

Let’s add this test to Start:

Test(sw, report, "CompareExchange+SpinOnce", compareExchangeSpinOnce);
Running the Test

With all of our test functions written, let’s check out the final test script:

using System.Diagnostics;
using System.Text;
using System.Threading;
using UnityEngine;
 
class TestScript : MonoBehaviour
{
    volatile int Value;
    readonly object lockObject = new object();
    Mutex mutex = new Mutex();
    const int reps = 1000000;
 
    void Start()
    {
        StringBuilder report = new StringBuilder(4096);
        report.AppendLine("Test,IsCorrect,One Thread Ticks,Two Thread Ticks");
        Stopwatch sw = Stopwatch.StartNew();
        Test(sw, report, "No Sync", NoSync);
        Test(sw, report, "Lock Statement", lockStatement);
        Test(sw, report, "Mutex", mutexWaitOne);
        Test(sw, report, "CompareExchange", compareExchange);
        Test(sw, report, "CompareExchange+SpinOnce", compareExchangeSpinOnce);
        print(report);
    }
 
    void Test(Stopwatch sw, StringBuilder report, string label, ThreadStart t)
    {
        Thread threadA = new Thread(t);
        Value = 0;
        sw.Restart();
        threadA.Start();
        threadA.Join();
        long oneThreadTicks = sw.ElapsedTicks;
 
        Thread threadB = new Thread(t);
        Thread threadC = new Thread(t);
        Value = 0;
        sw.Restart();
        threadB.Start();
        threadC.Start();
        threadB.Join();
        threadC.Join();
        long twoThreadTicks = sw.ElapsedTicks;
 
        report
            .Append(label).Append(',')
            .Append(Value == reps*2).Append(',')
            .Append(oneThreadTicks).Append(',')
            .Append(twoThreadTicks).Append('n');
    }
 
    void NoSync()
    {
        for (int i = 0; i < reps; ++i)
        {
            Value++;
        }
    }
 
    void lockStatement()
    {
        for (int i = 0; i < reps; ++i)
        {
            lock (lockObject)
            {
                Value++;
            }
        }
    }
 
    void mutexWaitOne()
    {
        for (int i = 0; i < reps; ++i)
        {
            mutex.WaitOne();
            Value++;
            mutex.ReleaseMutex();
        }
    }
 
    void compareExchange()
    {
        for (int i = 0; i < reps; ++i)
        {
            while (true)
            {
                int init = Value;
                int sum = init + 1;
                if (Interlocked.CompareExchange(ref Value, sum, init) == init)
                {
                    break;
                }
            }
        }
    }
 
    void compareExchangeSpinOnce()
    {
        for (int i = 0; i < reps; ++i)
        {
            SpinWait spin = new SpinWait();
            while (true)
            {
                int init = Value;
                int sum = init + 1;
                if (Interlocked.CompareExchange(ref Value, sum, init) == init)
                {
                    break;
                }
                spin.SpinOnce();
            }
        }
    }
}

I ran the test in this environment:

  • 2.7 Ghz Intel Core i7-6820HQ
  • macOS 10.15.4
  • Unity 2019.3.9f1
  • Burst package 1.2.3
  • macOS Standalone
  • .NET 4.x scripting runtime version and API compatibility level
  • IL2CPP
  • Non-development
  • 640×480, Fastest, Windowed

And here are the results I got:

Test IsCorrect One Thread Ticks Two Thread Ticks
No Sync False 385315 668500
Lock Statement True 1846390 8938287
Mutex True 2719061 14309562
CompareExchange True 401456 1462565
CompareExchange+SpinOnce True 402419 802026

Thread Synchronization Performance Graph (raw)

Thread Synchronization Performance Graph (increase over baseline)

Not surprisingly, the baseline NoSync function was the fastest regardless of the number of threads being run as it performs no synchronization whatsoever. Also not surprisingly, it produces an incorrect result for Value. Even in a small test such as this, it shows the need for synchronization of some kind.

lock statements were hugely more expensive compared to the baseline. The time taken was over 4x for one thread (i.e. no contention) and over 12x for two threads. NoSync increased 1.73x when adding the second thread, but lock statements increased 4.84x. That’s the cost they paid to synchronize with some actual contention.

Mutex was even slower: over 7x more time than NoSync for one thread and over 20x more time for two threads. The jump from one thread to two was 5.26x, even more of a penalty than with lock statements.

CompareExchange was much quicker. With one thread, it took only 4% more time than NoSync. Two threads took 2.19x more time though, which is marginally higher than 1.73x from NoSync. The increase from one thread to two was 3.6x, the smallest so far.

CompareExchange+SpinOnce took only slightly more time with only one thread, but still about 4% longer than NoSync. The two thread time was much reduced compared to just CompareExchange: only 1.2x longer than NoSync. The time increase when going to two threads was only 2x, less than half the second-best (lock statements) and nearly the 1.73x that NoSync scored.

Concusion

Synchronization options are many and varied. They each come with their own complexity, safety, and performance characteristics. Which is best? No one performance test, including this one, can say definitively. The amount of work being done will surely vary from just an increment and affect things like the cost of retrying with CompareExchange. The data being accessed will likewise vary from just a single int. Exception safety may be needed, but is only tested here with lock statements. The usability of each is important too as thread synchronization is notoriously difficult and often the source of bugs.

Still, this test shows that CompareExchange with a SpinLock has the potential for great performance. It’s surely worth investigating along with the other options. The decision will ultimately need to be on a case-by-case basis, but knowing these various techniques will inform that decision. Knowing more options means there’s more chance that the best solution will be found. Speaking of more options, do you have any to share? If so, let me know in the comments!