Writing multi-threaded code is one of the keys to maximizing performance. Currently, this means creating your own threads and synchronizing them with C# keywords like lock and volatile as well as .NET classes like [ThreadStatic] and Interlocked. Today we’ll take a look at how these are implemented behind the scenes by IL2CPP to get some understanding of what we’re really telling the computer to do when we use them.

lock

The lock keyword is used to give exclusive control over some resources to a thread. When the block is entered, the thread has exclusive control. When it finishes, control is relinquished. Here’s a trivial example:

public static class TestClass
{
	public static int Lock(object o, int i)
	{
		lock (o)
		{
			return i;
		}
	}
}

Now here’s the C++ code that IL2CPP generates for this function in Unity 2017.4:

extern "C"  int32_t TestClass_Lock_m2756077292 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___o0, int32_t ___i1, const RuntimeMethod* method)
{
	RuntimeObject * V_0 = NULL;
	int32_t V_1 = 0;
	Exception_t * __last_unhandled_exception = 0;
	NO_UNUSED_WARNING (__last_unhandled_exception);
	Exception_t * __exception_local = 0;
	NO_UNUSED_WARNING (__exception_local);
	int32_t __leave_target = 0;
	NO_UNUSED_WARNING (__leave_target);
	{
		RuntimeObject * L_0 = ___o0;
		V_0 = L_0;
		RuntimeObject * L_1 = V_0;
		Monitor_Enter_m2249409497(NULL /*static, unused*/, L_1, /*hidden argument*/NULL);
	}
 
IL_0008:
	try
	{ // begin try (depth: 1)
		int32_t L_2 = ___i1;
		V_1 = L_2;
		IL2CPP_LEAVE(0x16, FINALLY_000f);
	} // end try (depth: 1)
	catch(Il2CppExceptionWrapper& e)
	{
		__last_unhandled_exception = (Exception_t *)e.ex;
		goto FINALLY_000f;
	}
 
FINALLY_000f:
	{ // begin finally (depth: 1)
		RuntimeObject * L_3 = V_0;
		Monitor_Exit_m3585316909(NULL /*static, unused*/, L_3, /*hidden argument*/NULL);
		IL2CPP_END_FINALLY(15)
	} // end finally (depth: 1)
	IL2CPP_CLEANUP(15)
	{
		IL2CPP_JUMP_TBL(0x16, IL_0016)
		IL2CPP_RETHROW_IF_UNHANDLED(Exception_t *)
	}
 
IL_0016:
	{
		int32_t L_4 = V_1;
		return L_4;
	}
}

The vast majority of this is to implement the try-finally that lock implies. This means that if an exception is thrown in the lock block of code that control will still be given back to the locked resource. In this case there’s no way that return i can actually throw an exception so all this exception handling is pointless. It’s worth noting here that this try-finally does not result in method initialization overhead, as we normally get when using exceptions.

Let’s see if the C++ compiler was able to figure that that all of this exception-related code isn’t necessary by looking at how Xcode 9.2 compiles the C++ to ARM assembly:

	push	{r4, r5, r7, lr}
	add	r7, sp, #8
	mov	r4, r2
	movs	r0, #0
	movs	r2, #0
	mov	r5, r1
	bl	_Monitor_Enter_m2249409497
	movs	r0, #0
	mov	r1, r5
	movs	r2, #0
	bl	_Monitor_Exit_m3585316909
	mov	r0, r4
	pop	{r4, r5, r7, pc}

All of that messy C++ has boiled down to basically just a couple of calls to Monitor.Enter and Monitor.Exit. These are the calls that perform the actual locking and unlocking, meaning that lock is really just syntax sugar for these calls plus the try-finally. So let’s look at how they’re implemented:

extern "C"  void Monitor_Enter_m2249409497 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	typedef void (*Monitor_Enter_m2249409497_ftn) (RuntimeObject *);
	using namespace il2cpp::icalls;
	 ((Monitor_Enter_m2249409497_ftn)mscorlib::System::Threading::Monitor::Enter) (___obj0);
}
 
extern "C"  void Monitor_Exit_m3585316909 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	typedef void (*Monitor_Exit_m3585316909_ftn) (RuntimeObject *);
	using namespace il2cpp::icalls;
	 ((Monitor_Exit_m3585316909_ftn)mscorlib::System::Threading::Monitor::Exit) (___obj0);
}

These are essentially just stubs that call into the IL2CPP library. The source code for it is in the Unity installation directory with the name libil2cpp. Opening up icalls/mscorlib/System.Threading/Monitor.cpp we see this:

void Monitor::Enter(Il2CppObject* obj)
{
    IL2CPP_CHECK_ARG_NULL(obj);
    il2cpp::vm::Monitor::Enter(obj);
}
 
void Monitor::Exit(Il2CppObject* obj)
{
    IL2CPP_CHECK_ARG_NULL(obj);
    il2cpp::vm::Monitor::Exit(obj);
}

These are basically just more wrapper functions, so we need to open up libil2cpp/vm/Monitor.cpp to see how they really work:

void Monitor::Enter(Il2CppObject* object)
{
    TryEnter(object, std::numeric_limits<uint32_t>::max());
}
 
bool Monitor::TryEnter(Il2CppObject* obj, uint32_t timeOutMilliseconds)
{
    // NOTE: contents removed
}
 
void Monitor::Exit(Il2CppObject* obj)
{
    // NOTE: contents removed
}

These functions are several hundred lines long, but they’re really well commented so feel free to give them a read to better understand what’s going on. In short, Enter and TryEnter are implementing Monitor.Enter via a lot of atomic operations. The same goes for Exit and Monitor.Exit.

volatile

The C# volatile keyword is documented as indicating that a variable may be accessed from multiple threads so certain optimizations should be disabled and the variable’s value should always be up to date. What does this mean in practice with IL2CPP? Let’s write a pair of test functions to find out:

public struct MyStruct
{
	public volatile int VolatileField;
}
 
public static class TestClass
{
	public static int ReadVolatileField(MyStruct x)
	{
		return x.VolatileField;
	}
 
	public static void WriteVolatileField(MyStruct x)
	{
		x.VolatileField = 123;
	}
}

Here’s the C++ that IL2CPP generates:

extern "C"  int32_t TestClass_ReadVolatileField_m4039530565 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int32_t L_0 = (&___x0)->get_VolatileField_0();
		il2cpp_codegen_memory_barrier();
		return L_0;
	}
}
 
extern "C"  void TestClass_WriteVolatileField_m741190780 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		il2cpp_codegen_memory_barrier();
		(&___x0)->set_VolatileField_0(((int32_t)123));
		return;
	}
}

When reading, the volatile field is read and then il2cpp_codegen_memory_barrier is called. Writing is the opposite: il2cpp_codegen_memory_barrier is called then the value is written. Let’s dive in to see how the memory barrier is implemented:

inline void il2cpp_codegen_memory_barrier()
{
    il2cpp::vm::Thread::FullMemoryBarrier();
}

This is another wrapper function, so let’s go into the IL2CPP source code like above to find Thread.cpp:

void Thread::FullMemoryBarrier()
{
    os::Atomic::FullMemoryBarrier();
}

This wrapper function calls into another part of IL2CPP, so let’s go to Atomic.h:

#if !IL2CPP_SUPPORT_THREADS
 
namespace il2cpp
{
namespace os
{
    inline void Atomic::FullMemoryBarrier()
    {
        // Do nothing.
    }
}
}
 
#elif IL2CPP_TARGET_WINDOWS
#include "os/Win32/AtomicImpl.h"
#elif IL2CPP_TARGET_PS4
#include "os/AtomicImpl.h"  // has to come earlier than posix
#elif IL2CPP_TARGET_PSP2
#include "os/PSP2/AtomicImpl.h"
#elif IL2CPP_TARGET_POSIX
#include "os/Posix/AtomicImpl.h"
#else
#include "os/AtomicImpl.h"
#endif

Here we see that the implementation of FullMemoryBarrier is platform-specific. Since we’re on POSIX-based iOS, we’ll read os/Posix/PosixImpl.h:

inline void Atomic::FullMemoryBarrier()
{
    __sync_synchronize();
}

Finally we’ve reached the implementation! The GCC docs describe __sync_synchronize as simply “a full memory barrier” which means that both reads and writes before the memory barrier are committed to memory before any memory reads and writes after the memory barrier. This is overkill since we’re just reading or writing, not both, but the difference likely doesn’t matter to any C# script as only extremely performance-critical code needs this level of optimization.

[ThreadStatic]

The [ThreadStatic] attribute is applied to static variables that need to be scoped to a single thread rather than all threads in the process. Let’s write a similar pair of test functions to see how it’s implemented:

public struct MyStruct
{
	public volatile int VolatileField;
	[ThreadStatic] public static int ThreadStaticField;
}
 
public static class TestClass
{
	public static int ReadThreadStaticField()
	{
		return MyStruct.ThreadStaticField;
	}
 
	public static void WriteThreadStaticField()
	{
		MyStruct.ThreadStaticField = 123;
	}
}

Here’s the C++ that IL2CPP turns this into:

extern "C"  int32_t TestClass_ReadThreadStaticField_m1391367394 (RuntimeObject * __this /* static, unused */, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_ReadThreadStaticField_m1391367394_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		int32_t L_0 = ((MyStruct_t123831593_ThreadStaticFields*)il2cpp_codegen_get_thread_static_data(MyStruct_t123831593_il2cpp_TypeInfo_var))->get_ThreadStaticField_2();
		return L_0;
	}
}
 
extern "C"  void TestClass_WriteThreadStaticField_m959076367 (RuntimeObject * __this /* static, unused */, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_WriteThreadStaticField_m959076367_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		((MyStruct_t123831593_ThreadStaticFields*)il2cpp_codegen_get_thread_static_data(MyStruct_t123831593_il2cpp_TypeInfo_var))->set_ThreadStaticField_2(((int32_t)123));
		return;
	}
}

Each of these get method initialization overhead, as is the case with all other functions that access static fields. Beyond that, there’s a call to il2cpp_codegen_get_thread_static_data followed by get_ThreadStaticField_2 for reading or set_ThreadStaticField_2 for writing. Let’s go look at il2cpp_codegen_get_thread_static_data to start:

inline void* il2cpp_codegen_get_thread_static_data(RuntimeClass* klass)
{
    return il2cpp::vm::Thread::GetThreadStaticData(klass->thread_static_fields_offset);
}

This wrapper function calls into libil2cpp, so let’s go there and see how it’s implemented:

void* Thread::GetThreadStaticData(int32_t offset)
{
    // No lock. We allocate static_data once with a fixed size so we can read it
    // safely without a lock here.
    IL2CPP_ASSERT(offset >= 0 && static_cast<uint32_t>(offset) < s_ThreadStaticSizes.size());
    return Current()->GetInternalThread()->static_data[offset];
}

The data is simply kept in an array associated with the Thread object. Let’s look at each part of this, starting with Current:

Il2CppThread* Thread::Current()
{
    void* value = NULL;
    s_CurrentThread.GetValue(&value);
    return (Il2CppThread*)value;
}

The current thread is just a static variable: s_CurrentThread. Next there’s a call to GetInternalThread:

#if !NET_4_0
Il2CppThread* GetInternalThread()
{
    return this;
}
 
#else
Il2CppInternalThread* GetInternalThread() const
{
    return internal_thread;
}
 
#endif

This differs depending on the choice of .NET version. .NET 4 defers to an Il2CppInternalThread, but both have a static_data field.

Jumping back to the il2cpp_codegen_get_thread_static_data, we see that it returns a MyStruct_t123831593_ThreadStaticFields object that holds all the [ThreadStatic] fields:

struct MyStruct_t123831593_ThreadStaticFields
{
public:
	// System.Int32 MyStruct::ThreadStaticField
	int32_t ___ThreadStaticField_2;
 
public:
	inline static int32_t get_offset_of_ThreadStaticField_2() { return static_cast<int32_t>(offsetof(MyStruct_t123831593_ThreadStaticFields, ___ThreadStaticField_2)); }
	inline int32_t get_ThreadStaticField_2() const { return ___ThreadStaticField_2; }
	inline int32_t* get_address_of_ThreadStaticField_2() { return &___ThreadStaticField_2; }
	inline void set_ThreadStaticField_2(int32_t value)
	{
		___ThreadStaticField_2 = value;
	}
};

In this case there’s just one field and a bunch of accessors for it. So when get_ThreadStaticField_2 or set_ThreadStaticField_2 are called, they’re just trivially getting or setting the field.

Interlocked

The Interlocked class in .NET has a lot of static functions to help synchronize multi-threaded code. These include a variety of atomic operations that serve as alternatives to volatile and lock. Let’s try some out:

public struct MyStruct
{
	public volatile int VolatileField;
	public long NonVolatileField;
	[ThreadStatic] public static int ThreadStaticField;
}
 
public static class TestClass
{
	public static void InterlockedCompareExchange(MyStruct x)
	{
		Interlocked.CompareExchange(ref x.NonVolatileField, 123, 456);
	}
 
	public static void InterlockedExchange(MyStruct x)
	{
		Interlocked.Exchange(ref x.NonVolatileField, 123);
	}
 
	public static void InterlockedAdd(MyStruct x)
	{
		Interlocked.Add(ref x.NonVolatileField, 123);
	}
 
	public static void InterlockedIncrement(MyStruct x)
	{
		Interlocked.Increment(ref x.NonVolatileField);
	}
 
	public static long InterlockedRead(MyStruct x)
	{
		return Interlocked.Read(ref x.NonVolatileField);
	}
}

Let’s see the IL2CPP output for these:

extern "C"  void TestClass_InterlockedCompareExchange_m2663092245 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1();
		Interlocked_CompareExchange_m1385746522(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), (((int64_t)((int64_t)((int32_t)456)))), /*hidden argument*/NULL);
		return;
	}
}
 
extern "C"  void TestClass_InterlockedExchange_m3541638669 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1();
		Interlocked_Exchange_m3049791109(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), /*hidden argument*/NULL);
		return;
	}
}
 
extern "C"  void TestClass_InterlockedAdd_m51470974 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1();
		Interlocked_Add_m802687858(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), /*hidden argument*/NULL);
		return;
	}
}
 
extern "C"  void TestClass_InterlockedIncrement_m2224055823 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1();
		Interlocked_Increment_m1565533900(NULL /*static, unused*/, L_0, /*hidden argument*/NULL);
		return;
	}
}
 
extern "C"  int64_t TestClass_InterlockedRead_m3203030973 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593  ___x0, const RuntimeMethod* method)
{
	{
		int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1();
		int64_t L_1 = Interlocked_Read_m673992094(NULL /*static, unused*/, L_0, /*hidden argument*/NULL);
		return L_1;
	}
}

All of these are just calls into their respective Interlocked functions, so let’s look at those:

extern "C"  int64_t Interlocked_CompareExchange_m1385746522 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, int64_t ___comparand2, const RuntimeMethod* method)
{
	typedef int64_t (*Interlocked_CompareExchange_m1385746522_ftn) (int64_t*, int64_t, int64_t);
	using namespace il2cpp::icalls;
	return  ((Interlocked_CompareExchange_m1385746522_ftn)mscorlib::System::Threading::Interlocked::CompareExchange64) (___location10, ___value1, ___comparand2);
}
 
extern "C"  int64_t Interlocked_Exchange_m3049791109 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, const RuntimeMethod* method)
{
	typedef int64_t (*Interlocked_Exchange_m3049791109_ftn) (int64_t*, int64_t);
	using namespace il2cpp::icalls;
	return  ((Interlocked_Exchange_m3049791109_ftn)mscorlib::System::Threading::Interlocked::Exchange64) (___location10, ___value1);
}
 
extern "C"  int64_t Interlocked_Add_m802687858 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, const RuntimeMethod* method)
{
	typedef int64_t (*Interlocked_Add_m802687858_ftn) (int64_t*, int64_t);
	using namespace il2cpp::icalls;
	return  ((Interlocked_Add_m802687858_ftn)mscorlib::System::Threading::Interlocked::Add64) (___location10, ___value1);
}
 
extern "C"  int64_t Interlocked_Increment_m1565533900 (RuntimeObject * __this /* static, unused */, int64_t* ___location0, const RuntimeMethod* method)
{
	typedef int64_t (*Interlocked_Increment_m1565533900_ftn) (int64_t*);
	using namespace il2cpp::icalls;
	return  ((Interlocked_Increment_m1565533900_ftn)mscorlib::System::Threading::Interlocked::Increment64) (___location0);
}
 
extern "C"  int64_t Interlocked_Read_m673992094 (RuntimeObject * __this /* static, unused */, int64_t* ___location0, const RuntimeMethod* method)
{
	typedef int64_t (*Interlocked_Read_m673992094_ftn) (int64_t*);
	using namespace il2cpp::icalls;
	return  ((Interlocked_Read_m673992094_ftn)mscorlib::System::Threading::Interlocked::Read) (___location0);
}

Just like above, these are wrapper functions that call into libil2cpp. Let’s go through Interlocked.cpp starting with CompareExchange:

int64_t Interlocked::CompareExchange64(int64_t* location, int64_t value, int64_t comparand)
{
#if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
    return Atomic::CompareExchange64(location, value, comparand);
#else
    FastAutoLock lock(&m_Atomic64Mutex);
    int64_t orig = *location;
    if (*location == comparand)
        *location = value;
 
    return orig;
#endif
}

Here we see that the function is implemented two different ways depending on whether IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT is set. What determines that? To find out, let’s look at il2cpp-config.h:

// 64-bit types are aligned to 8 bytes on 64-bit platforms and always on Windows
#define IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT ((IL2CPP_SIZEOF_VOID_P == 8) || (IL2CPP_TARGET_WINDOWS))

So IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT is set on 64-bit platforms and on Windows. This means that CompareExchange64 is implemented just with a call to Atomic::CompareExchange64 on those platforms, which are presumably the most common nowadays. Let’s look at Atomic.h to see Atomic::CompareExchange64:

static inline int64_t CompareExchange64(volatile int64_t* dest, int64_t exchange, int64_t comparand)
{
    return UnityPalCompareExchange64(dest, exchange, comparand);
}

This is yet-another wrapper function, so we need to find UnityPalCompareExchange64 in Atomic-c-api.h:

#elif IL2CPP_TARGET_WINDOWS
#include "Win32/AtomicImpl-c-api.h"
#elif IL2CPP_TARGET_PS4
#include "PS4/AtomicImpl-c-api.h"  // has to come earlier than posix
#elif IL2CPP_TARGET_PSP2
#include "PSP2/AtomicImpl-c-api.h"
#elif IL2CPP_TARGET_POSIX
#include "Posix/AtomicImpl-c-api.h"
#else
#include "AtomicImpl-c-api.h"
#endif

This is another area of platform-specific code, so we’ll follow to Posix/AtomicImpl-c-api.h to find the iOS version:

inline int64_t UnityPalCompareExchange64(volatile int64_t* dest, int64_t exchange, int64_t comparand)
{
    ASSERT_ALIGNMENT(dest, 8);
#ifdef __EMSCRIPTEN__
    return emscripten_atomic_cas_u64((void*)dest, comparand, exchange) == comparand ? comparand : *dest;
#else
    return __sync_val_compare_and_swap(dest, comparand, exchange);
#endif
}

Finally, we’ve found the implementation! There’s an Emscripten-specific version for WebGL builds but for iOS we’re using another builtin from GCC described like this:

These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr.

The “bool” version returns true if the comparison is successful and newval was written. The “val” version returns the contents of *ptr before the operation.

This means that Interlocked.CompareExchange calls a series of wrapper functions that result in a single atomic compare and swap operation on 64-bit and Windows platforms. Otherwise, the atomic operation is replaced with a FastAutoLock and an if branch. To find out how FastAutoLock works, we go to Mutex.h:

class FastMutexImpl;
 
class FastMutex
{
public:
    FastMutex();
    ~FastMutex();
 
    void Lock();
    void Unlock();
 
    FastMutexImpl* GetImpl();
 
private:
    FastMutexImpl* m_Impl;
};
 
struct FastAutoLock : public il2cpp::utils::NonCopyable
{
    FastAutoLock(FastMutex* mutex)
        : m_Mutex(mutex)
    {
        m_Mutex->Lock();
    }
 
    ~FastAutoLock()
    {
        m_Mutex->Unlock();
    }
 
private:
    FastMutex* m_Mutex;
};

So FastAutoLock really just wraps a FastMutex which in turn wraps a FastMutexImpl, which is another platform-specific part of the code. Let’s look at the POSIX version that’s used on iOS:

class FastMutexImpl
{
public:
 
    FastMutexImpl()
    {
        pthread_mutexattr_t attr;
        pthread_mutexattr_init(&attr);
        pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
        pthread_mutex_init(&m_Mutex, &attr);
        pthread_mutexattr_destroy(&attr);
    }
 
    ~FastMutexImpl()
    {
        pthread_mutex_destroy(&m_Mutex);
    }
 
    void Lock()
    {
        pthread_mutex_lock(&m_Mutex);
    }
 
    void Unlock()
    {
        pthread_mutex_unlock(&m_Mutex);
    }
 
    pthread_mutex_t* GetOSHandle()
    {
        return &m_Mutex;
    }
 
private:
    pthread_mutex_t m_Mutex;
};

The POSIX version is, unsurprisingly, just a wrapper around calls into the POSIX Threads (a.k.a. pthreads) library. This means that FastAutoLock is acquiring a full lock in order to implement Interlocked.CompareExchange. At least that’s the case for the overload using long. The int version always uses an atomic operation instead of threads:

int32_t Interlocked::CompareExchange(int32_t* location, int32_t value, int32_t comparand)
{
    return Atomic::CompareExchange(location, value, comparand);
}

Next let’s look at Interlocked.Exchange:

int64_t Interlocked::Exchange64(int64_t* location1, int64_t value)
{
#if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
    return Atomic::Exchange64(location1, value);
#else
    FastAutoLock lock(&m_Atomic64Mutex);
    int64_t orig = *location1;
    *location1 = value;
    return orig;
#endif
}

This is the same pattern as before. The locking version is clear, so let’s look at what the Atomic::Exchange64 version ends up calling:

inline int64_t UnityPalExchange64(volatile int64_t* dest, int64_t exchange)
{
    ASSERT_ALIGNMENT(dest, 8);
#ifdef __EMSCRIPTEN__
    return emscripten_atomic_exchange_u64((void*)dest, exchange);
#else
    int64_t prev;
    do
    {
        prev = *dest;
    }
    while (!__sync_bool_compare_and_swap(dest, prev, exchange));
    return prev;
#endif
}

This is similar to UnityPalCompareExchange64, except that it must loop until the “compare and swap” operation succeeds.

Now let’s look at Interlocked.Add:

int64_t Interlocked::Add64(int64_t* location1, int64_t value)
{
#if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
    return Atomic::Add64(location1, value);
#else
    FastAutoLock lock(&m_Atomic64Mutex);
    return *location1 += value;
#endif
}

Again we see the same pattern, so let’ just look at the atomic version:

inline int64_t UnityPalAdd64(volatile int64_t* location1, int64_t value)
{
    ASSERT_ALIGNMENT(location1, 8);
    return __sync_add_and_fetch(location1, value);
}

The GCC docs describe __sync_add_and_fetch like this:

These builtins perform the operation suggested by the name, and return the new value. That is,
{ *ptr op= value; return *ptr; }
{ *ptr = ~(*ptr & value); return *ptr; } // nand

Next is Interlocked.Increment:

int64_t Interlocked::Increment64(int64_t* location)
{
#if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
    return Atomic::Increment64(location);
#else
    FastAutoLock lock(&m_Atomic64Mutex);
    return ++(*location);
#endif
}

The same pattern holds here, so let’s just check out the atomic version again:

inline int64_t UnityPalIncrement64(volatile int64_t* value)
{
    ASSERT_ALIGNMENT(value, 8);
    return __sync_add_and_fetch(value, 1);
}

This version uses the same __sync_add_and_fetch, except with a constant 1 value.

Finally, let’s see Interlocked.Read:

int64_t Interlocked::Read(int64_t* location)
{
#if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
    return Atomic::Read64(location);
#else
    FastAutoLock lock(&m_Atomic64Mutex);
    return *location;
#endif
}

And here’s the atomic version:

inline int64_t UnityPalRead64(volatile int64_t* addr)
{
    ASSERT_ALIGNMENT(addr, 8);
    return __sync_fetch_and_add(addr, 0);
}

Instead of __sync_add_and_fetch, this uses __sync_fetch_and_add but adds the value 0 so there’s no effect on the variable. The GCC docs describe the builtin like this:

These builtins perform the operation suggested by the name, and returns the value that had previously been in memory. That is,
{ tmp = *ptr; *ptr op= value; return tmp; }
{ tmp = *ptr; *ptr = ~(tmp & value); return tmp; } // nand

Conclusion

The combination of IL2CPP, Clang (Xcode’s compiler), and atomic operation builtins has produced a rather efficient system to support multi-threading in C#. lock blocks boil down to just a couple of calls to well-optimized Monitor.Enter and Monitor.Exit functions. volatile variable access is only minimally encumbered by full memory barriers when just read or write would do, but this is more of a nit-pick than a serious performance concern for C# scripts. The [ThreadStatic] attribute simply associates an array with a Thread object and indexes into it when you access the variable. Finally, the various static functions of the Interlocked class are all hand-written and well optimized for a variety of target platforms. This is one area of Unity scripting where there’s not much need to worry about stepping on any performance-destroying landmines.