How IL2CPP implements lock, volatile, [ThreadStatic], and Interlocked
Writing multi-threaded code is one of the keys to maximizing performance. Currently, this means creating your own threads and synchronizing them with C# keywords like lock
and volatile
as well as .NET classes like [ThreadStatic]
and Interlocked
. Today we’ll take a look at how these are implemented behind the scenes by IL2CPP to get some understanding of what we’re really telling the computer to do when we use them.
lock
The lock
keyword is used to give exclusive control over some resources to a thread. When the block is entered, the thread has exclusive control. When it finishes, control is relinquished. Here’s a trivial example:
public static class TestClass { public static int Lock(object o, int i) { lock (o) { return i; } } }
Now here’s the C++ code that IL2CPP generates for this function in Unity 2017.4:
extern "C" int32_t TestClass_Lock_m2756077292 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___o0, int32_t ___i1, const RuntimeMethod* method) { RuntimeObject * V_0 = NULL; int32_t V_1 = 0; Exception_t * __last_unhandled_exception = 0; NO_UNUSED_WARNING (__last_unhandled_exception); Exception_t * __exception_local = 0; NO_UNUSED_WARNING (__exception_local); int32_t __leave_target = 0; NO_UNUSED_WARNING (__leave_target); { RuntimeObject * L_0 = ___o0; V_0 = L_0; RuntimeObject * L_1 = V_0; Monitor_Enter_m2249409497(NULL /*static, unused*/, L_1, /*hidden argument*/NULL); } IL_0008: try { // begin try (depth: 1) int32_t L_2 = ___i1; V_1 = L_2; IL2CPP_LEAVE(0x16, FINALLY_000f); } // end try (depth: 1) catch(Il2CppExceptionWrapper& e) { __last_unhandled_exception = (Exception_t *)e.ex; goto FINALLY_000f; } FINALLY_000f: { // begin finally (depth: 1) RuntimeObject * L_3 = V_0; Monitor_Exit_m3585316909(NULL /*static, unused*/, L_3, /*hidden argument*/NULL); IL2CPP_END_FINALLY(15) } // end finally (depth: 1) IL2CPP_CLEANUP(15) { IL2CPP_JUMP_TBL(0x16, IL_0016) IL2CPP_RETHROW_IF_UNHANDLED(Exception_t *) } IL_0016: { int32_t L_4 = V_1; return L_4; } }
The vast majority of this is to implement the try-finally
that lock
implies. This means that if an exception is thrown in the lock
block of code that control will still be given back to the locked resource. In this case there’s no way that return i
can actually throw an exception so all this exception handling is pointless. It’s worth noting here that this try-finally
does not result in method initialization overhead, as we normally get when using exceptions.
Let’s see if the C++ compiler was able to figure that that all of this exception-related code isn’t necessary by looking at how Xcode 9.2 compiles the C++ to ARM assembly:
push {r4, r5, r7, lr} add r7, sp, #8 mov r4, r2 movs r0, #0 movs r2, #0 mov r5, r1 bl _Monitor_Enter_m2249409497 movs r0, #0 mov r1, r5 movs r2, #0 bl _Monitor_Exit_m3585316909 mov r0, r4 pop {r4, r5, r7, pc}
All of that messy C++ has boiled down to basically just a couple of calls to Monitor.Enter
and Monitor.Exit
. These are the calls that perform the actual locking and unlocking, meaning that lock
is really just syntax sugar for these calls plus the try-finally
. So let’s look at how they’re implemented:
extern "C" void Monitor_Enter_m2249409497 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method) { typedef void (*Monitor_Enter_m2249409497_ftn) (RuntimeObject *); using namespace il2cpp::icalls; ((Monitor_Enter_m2249409497_ftn)mscorlib::System::Threading::Monitor::Enter) (___obj0); } extern "C" void Monitor_Exit_m3585316909 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method) { typedef void (*Monitor_Exit_m3585316909_ftn) (RuntimeObject *); using namespace il2cpp::icalls; ((Monitor_Exit_m3585316909_ftn)mscorlib::System::Threading::Monitor::Exit) (___obj0); }
These are essentially just stubs that call into the IL2CPP library. The source code for it is in the Unity installation directory with the name libil2cpp
. Opening up icalls/mscorlib/System.Threading/Monitor.cpp
we see this:
void Monitor::Enter(Il2CppObject* obj) { IL2CPP_CHECK_ARG_NULL(obj); il2cpp::vm::Monitor::Enter(obj); } void Monitor::Exit(Il2CppObject* obj) { IL2CPP_CHECK_ARG_NULL(obj); il2cpp::vm::Monitor::Exit(obj); }
These are basically just more wrapper functions, so we need to open up libil2cpp/vm/Monitor.cpp
to see how they really work:
void Monitor::Enter(Il2CppObject* object) { TryEnter(object, std::numeric_limits<uint32_t>::max()); } bool Monitor::TryEnter(Il2CppObject* obj, uint32_t timeOutMilliseconds) { // NOTE: contents removed } void Monitor::Exit(Il2CppObject* obj) { // NOTE: contents removed }
These functions are several hundred lines long, but they’re really well commented so feel free to give them a read to better understand what’s going on. In short, Enter
and TryEnter
are implementing Monitor.Enter
via a lot of atomic operations. The same goes for Exit
and Monitor.Exit
.
volatile
The C# volatile
keyword is documented as indicating that a variable may be accessed from multiple threads so certain optimizations should be disabled and the variable’s value should always be up to date. What does this mean in practice with IL2CPP? Let’s write a pair of test functions to find out:
public struct MyStruct { public volatile int VolatileField; } public static class TestClass { public static int ReadVolatileField(MyStruct x) { return x.VolatileField; } public static void WriteVolatileField(MyStruct x) { x.VolatileField = 123; } }
Here’s the C++ that IL2CPP generates:
extern "C" int32_t TestClass_ReadVolatileField_m4039530565 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int32_t L_0 = (&___x0)->get_VolatileField_0(); il2cpp_codegen_memory_barrier(); return L_0; } } extern "C" void TestClass_WriteVolatileField_m741190780 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { il2cpp_codegen_memory_barrier(); (&___x0)->set_VolatileField_0(((int32_t)123)); return; } }
When reading, the volatile
field is read and then il2cpp_codegen_memory_barrier
is called. Writing is the opposite: il2cpp_codegen_memory_barrier
is called then the value is written. Let’s dive in to see how the memory barrier is implemented:
inline void il2cpp_codegen_memory_barrier() { il2cpp::vm::Thread::FullMemoryBarrier(); }
This is another wrapper function, so let’s go into the IL2CPP source code like above to find Thread.cpp
:
void Thread::FullMemoryBarrier() { os::Atomic::FullMemoryBarrier(); }
This wrapper function calls into another part of IL2CPP, so let’s go to Atomic.h
:
#if !IL2CPP_SUPPORT_THREADS namespace il2cpp { namespace os { inline void Atomic::FullMemoryBarrier() { // Do nothing. } } } #elif IL2CPP_TARGET_WINDOWS #include "os/Win32/AtomicImpl.h" #elif IL2CPP_TARGET_PS4 #include "os/AtomicImpl.h" // has to come earlier than posix #elif IL2CPP_TARGET_PSP2 #include "os/PSP2/AtomicImpl.h" #elif IL2CPP_TARGET_POSIX #include "os/Posix/AtomicImpl.h" #else #include "os/AtomicImpl.h" #endif
Here we see that the implementation of FullMemoryBarrier
is platform-specific. Since we’re on POSIX-based iOS, we’ll read os/Posix/PosixImpl.h
:
inline void Atomic::FullMemoryBarrier() { __sync_synchronize(); }
Finally we’ve reached the implementation! The GCC docs describe __sync_synchronize
as simply “a full memory barrier” which means that both reads and writes before the memory barrier are committed to memory before any memory reads and writes after the memory barrier. This is overkill since we’re just reading or writing, not both, but the difference likely doesn’t matter to any C# script as only extremely performance-critical code needs this level of optimization.
[ThreadStatic]
The [ThreadStatic]
attribute is applied to static variables that need to be scoped to a single thread rather than all threads in the process. Let’s write a similar pair of test functions to see how it’s implemented:
public struct MyStruct { public volatile int VolatileField; [ThreadStatic] public static int ThreadStaticField; } public static class TestClass { public static int ReadThreadStaticField() { return MyStruct.ThreadStaticField; } public static void WriteThreadStaticField() { MyStruct.ThreadStaticField = 123; } }
Here’s the C++ that IL2CPP turns this into:
extern "C" int32_t TestClass_ReadThreadStaticField_m1391367394 (RuntimeObject * __this /* static, unused */, const RuntimeMethod* method) { static bool s_Il2CppMethodInitialized; if (!s_Il2CppMethodInitialized) { il2cpp_codegen_initialize_method (TestClass_ReadThreadStaticField_m1391367394_MetadataUsageId); s_Il2CppMethodInitialized = true; } { int32_t L_0 = ((MyStruct_t123831593_ThreadStaticFields*)il2cpp_codegen_get_thread_static_data(MyStruct_t123831593_il2cpp_TypeInfo_var))->get_ThreadStaticField_2(); return L_0; } } extern "C" void TestClass_WriteThreadStaticField_m959076367 (RuntimeObject * __this /* static, unused */, const RuntimeMethod* method) { static bool s_Il2CppMethodInitialized; if (!s_Il2CppMethodInitialized) { il2cpp_codegen_initialize_method (TestClass_WriteThreadStaticField_m959076367_MetadataUsageId); s_Il2CppMethodInitialized = true; } { ((MyStruct_t123831593_ThreadStaticFields*)il2cpp_codegen_get_thread_static_data(MyStruct_t123831593_il2cpp_TypeInfo_var))->set_ThreadStaticField_2(((int32_t)123)); return; } }
Each of these get method initialization overhead, as is the case with all other functions that access static fields. Beyond that, there’s a call to il2cpp_codegen_get_thread_static_data
followed by get_ThreadStaticField_2
for reading or set_ThreadStaticField_2
for writing. Let’s go look at il2cpp_codegen_get_thread_static_data
to start:
inline void* il2cpp_codegen_get_thread_static_data(RuntimeClass* klass) { return il2cpp::vm::Thread::GetThreadStaticData(klass->thread_static_fields_offset); }
This wrapper function calls into libil2cpp, so let’s go there and see how it’s implemented:
void* Thread::GetThreadStaticData(int32_t offset) { // No lock. We allocate static_data once with a fixed size so we can read it // safely without a lock here. IL2CPP_ASSERT(offset >= 0 && static_cast<uint32_t>(offset) < s_ThreadStaticSizes.size()); return Current()->GetInternalThread()->static_data[offset]; }
The data is simply kept in an array associated with the Thread
object. Let’s look at each part of this, starting with Current
:
Il2CppThread* Thread::Current() { void* value = NULL; s_CurrentThread.GetValue(&value); return (Il2CppThread*)value; }
The current thread is just a static variable: s_CurrentThread
. Next there’s a call to GetInternalThread
:
#if !NET_4_0 Il2CppThread* GetInternalThread() { return this; } #else Il2CppInternalThread* GetInternalThread() const { return internal_thread; } #endif
This differs depending on the choice of .NET version. .NET 4 defers to an Il2CppInternalThread
, but both have a static_data
field.
Jumping back to the il2cpp_codegen_get_thread_static_data
, we see that it returns a MyStruct_t123831593_ThreadStaticFields
object that holds all the [ThreadStatic]
fields:
struct MyStruct_t123831593_ThreadStaticFields { public: // System.Int32 MyStruct::ThreadStaticField int32_t ___ThreadStaticField_2; public: inline static int32_t get_offset_of_ThreadStaticField_2() { return static_cast<int32_t>(offsetof(MyStruct_t123831593_ThreadStaticFields, ___ThreadStaticField_2)); } inline int32_t get_ThreadStaticField_2() const { return ___ThreadStaticField_2; } inline int32_t* get_address_of_ThreadStaticField_2() { return &___ThreadStaticField_2; } inline void set_ThreadStaticField_2(int32_t value) { ___ThreadStaticField_2 = value; } };
In this case there’s just one field and a bunch of accessors for it. So when get_ThreadStaticField_2
or set_ThreadStaticField_2
are called, they’re just trivially getting or setting the field.
Interlocked
The Interlocked
class in .NET has a lot of static functions to help synchronize multi-threaded code. These include a variety of atomic operations that serve as alternatives to volatile
and lock
. Let’s try some out:
public struct MyStruct { public volatile int VolatileField; public long NonVolatileField; [ThreadStatic] public static int ThreadStaticField; } public static class TestClass { public static void InterlockedCompareExchange(MyStruct x) { Interlocked.CompareExchange(ref x.NonVolatileField, 123, 456); } public static void InterlockedExchange(MyStruct x) { Interlocked.Exchange(ref x.NonVolatileField, 123); } public static void InterlockedAdd(MyStruct x) { Interlocked.Add(ref x.NonVolatileField, 123); } public static void InterlockedIncrement(MyStruct x) { Interlocked.Increment(ref x.NonVolatileField); } public static long InterlockedRead(MyStruct x) { return Interlocked.Read(ref x.NonVolatileField); } }
Let’s see the IL2CPP output for these:
extern "C" void TestClass_InterlockedCompareExchange_m2663092245 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1(); Interlocked_CompareExchange_m1385746522(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), (((int64_t)((int64_t)((int32_t)456)))), /*hidden argument*/NULL); return; } } extern "C" void TestClass_InterlockedExchange_m3541638669 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1(); Interlocked_Exchange_m3049791109(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), /*hidden argument*/NULL); return; } } extern "C" void TestClass_InterlockedAdd_m51470974 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1(); Interlocked_Add_m802687858(NULL /*static, unused*/, L_0, (((int64_t)((int64_t)((int32_t)123)))), /*hidden argument*/NULL); return; } } extern "C" void TestClass_InterlockedIncrement_m2224055823 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1(); Interlocked_Increment_m1565533900(NULL /*static, unused*/, L_0, /*hidden argument*/NULL); return; } } extern "C" int64_t TestClass_InterlockedRead_m3203030973 (RuntimeObject * __this /* static, unused */, MyStruct_t123831593 ___x0, const RuntimeMethod* method) { { int64_t* L_0 = (&___x0)->get_address_of_NonVolatileField_1(); int64_t L_1 = Interlocked_Read_m673992094(NULL /*static, unused*/, L_0, /*hidden argument*/NULL); return L_1; } }
All of these are just calls into their respective Interlocked
functions, so let’s look at those:
extern "C" int64_t Interlocked_CompareExchange_m1385746522 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, int64_t ___comparand2, const RuntimeMethod* method) { typedef int64_t (*Interlocked_CompareExchange_m1385746522_ftn) (int64_t*, int64_t, int64_t); using namespace il2cpp::icalls; return ((Interlocked_CompareExchange_m1385746522_ftn)mscorlib::System::Threading::Interlocked::CompareExchange64) (___location10, ___value1, ___comparand2); } extern "C" int64_t Interlocked_Exchange_m3049791109 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, const RuntimeMethod* method) { typedef int64_t (*Interlocked_Exchange_m3049791109_ftn) (int64_t*, int64_t); using namespace il2cpp::icalls; return ((Interlocked_Exchange_m3049791109_ftn)mscorlib::System::Threading::Interlocked::Exchange64) (___location10, ___value1); } extern "C" int64_t Interlocked_Add_m802687858 (RuntimeObject * __this /* static, unused */, int64_t* ___location10, int64_t ___value1, const RuntimeMethod* method) { typedef int64_t (*Interlocked_Add_m802687858_ftn) (int64_t*, int64_t); using namespace il2cpp::icalls; return ((Interlocked_Add_m802687858_ftn)mscorlib::System::Threading::Interlocked::Add64) (___location10, ___value1); } extern "C" int64_t Interlocked_Increment_m1565533900 (RuntimeObject * __this /* static, unused */, int64_t* ___location0, const RuntimeMethod* method) { typedef int64_t (*Interlocked_Increment_m1565533900_ftn) (int64_t*); using namespace il2cpp::icalls; return ((Interlocked_Increment_m1565533900_ftn)mscorlib::System::Threading::Interlocked::Increment64) (___location0); } extern "C" int64_t Interlocked_Read_m673992094 (RuntimeObject * __this /* static, unused */, int64_t* ___location0, const RuntimeMethod* method) { typedef int64_t (*Interlocked_Read_m673992094_ftn) (int64_t*); using namespace il2cpp::icalls; return ((Interlocked_Read_m673992094_ftn)mscorlib::System::Threading::Interlocked::Read) (___location0); }
Just like above, these are wrapper functions that call into libil2cpp. Let’s go through Interlocked.cpp
starting with CompareExchange
:
int64_t Interlocked::CompareExchange64(int64_t* location, int64_t value, int64_t comparand) { #if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT return Atomic::CompareExchange64(location, value, comparand); #else FastAutoLock lock(&m_Atomic64Mutex); int64_t orig = *location; if (*location == comparand) *location = value; return orig; #endif }
Here we see that the function is implemented two different ways depending on whether IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
is set. What determines that? To find out, let’s look at il2cpp-config.h
:
// 64-bit types are aligned to 8 bytes on 64-bit platforms and always on Windows #define IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT ((IL2CPP_SIZEOF_VOID_P == 8) || (IL2CPP_TARGET_WINDOWS))
So IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT
is set on 64-bit platforms and on Windows. This means that CompareExchange64
is implemented just with a call to Atomic::CompareExchange64
on those platforms, which are presumably the most common nowadays. Let’s look at Atomic.h
to see Atomic::CompareExchange64
:
static inline int64_t CompareExchange64(volatile int64_t* dest, int64_t exchange, int64_t comparand) { return UnityPalCompareExchange64(dest, exchange, comparand); }
This is yet-another wrapper function, so we need to find UnityPalCompareExchange64
in Atomic-c-api.h
:
#elif IL2CPP_TARGET_WINDOWS #include "Win32/AtomicImpl-c-api.h" #elif IL2CPP_TARGET_PS4 #include "PS4/AtomicImpl-c-api.h" // has to come earlier than posix #elif IL2CPP_TARGET_PSP2 #include "PSP2/AtomicImpl-c-api.h" #elif IL2CPP_TARGET_POSIX #include "Posix/AtomicImpl-c-api.h" #else #include "AtomicImpl-c-api.h" #endif
This is another area of platform-specific code, so we’ll follow to Posix/AtomicImpl-c-api.h
to find the iOS version:
inline int64_t UnityPalCompareExchange64(volatile int64_t* dest, int64_t exchange, int64_t comparand) { ASSERT_ALIGNMENT(dest, 8); #ifdef __EMSCRIPTEN__ return emscripten_atomic_cas_u64((void*)dest, comparand, exchange) == comparand ? comparand : *dest; #else return __sync_val_compare_and_swap(dest, comparand, exchange); #endif }
Finally, we’ve found the implementation! There’s an Emscripten-specific version for WebGL builds but for iOS we’re using another builtin from GCC described like this:
These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr.
The “bool†version returns true if the comparison is successful and newval was written. The “val†version returns the contents of *ptr before the operation.
This means that Interlocked.CompareExchange
calls a series of wrapper functions that result in a single atomic compare and swap operation on 64-bit and Windows platforms. Otherwise, the atomic operation is replaced with a FastAutoLock
and an if
branch. To find out how FastAutoLock
works, we go to Mutex.h
:
class FastMutexImpl; class FastMutex { public: FastMutex(); ~FastMutex(); void Lock(); void Unlock(); FastMutexImpl* GetImpl(); private: FastMutexImpl* m_Impl; }; struct FastAutoLock : public il2cpp::utils::NonCopyable { FastAutoLock(FastMutex* mutex) : m_Mutex(mutex) { m_Mutex->Lock(); } ~FastAutoLock() { m_Mutex->Unlock(); } private: FastMutex* m_Mutex; };
So FastAutoLock
really just wraps a FastMutex
which in turn wraps a FastMutexImpl
, which is another platform-specific part of the code. Let’s look at the POSIX version that’s used on iOS:
class FastMutexImpl { public: FastMutexImpl() { pthread_mutexattr_t attr; pthread_mutexattr_init(&attr); pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); pthread_mutex_init(&m_Mutex, &attr); pthread_mutexattr_destroy(&attr); } ~FastMutexImpl() { pthread_mutex_destroy(&m_Mutex); } void Lock() { pthread_mutex_lock(&m_Mutex); } void Unlock() { pthread_mutex_unlock(&m_Mutex); } pthread_mutex_t* GetOSHandle() { return &m_Mutex; } private: pthread_mutex_t m_Mutex; };
The POSIX version is, unsurprisingly, just a wrapper around calls into the POSIX Threads (a.k.a. pthreads) library. This means that FastAutoLock
is acquiring a full lock in order to implement Interlocked.CompareExchange
. At least that’s the case for the overload using long
. The int
version always uses an atomic operation instead of threads:
int32_t Interlocked::CompareExchange(int32_t* location, int32_t value, int32_t comparand) { return Atomic::CompareExchange(location, value, comparand); }
Next let’s look at Interlocked.Exchange
:
int64_t Interlocked::Exchange64(int64_t* location1, int64_t value) { #if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT return Atomic::Exchange64(location1, value); #else FastAutoLock lock(&m_Atomic64Mutex); int64_t orig = *location1; *location1 = value; return orig; #endif }
This is the same pattern as before. The locking version is clear, so let’s look at what the Atomic::Exchange64
version ends up calling:
inline int64_t UnityPalExchange64(volatile int64_t* dest, int64_t exchange) { ASSERT_ALIGNMENT(dest, 8); #ifdef __EMSCRIPTEN__ return emscripten_atomic_exchange_u64((void*)dest, exchange); #else int64_t prev; do { prev = *dest; } while (!__sync_bool_compare_and_swap(dest, prev, exchange)); return prev; #endif }
This is similar to UnityPalCompareExchange64
, except that it must loop until the “compare and swap” operation succeeds.
Now let’s look at Interlocked.Add
:
int64_t Interlocked::Add64(int64_t* location1, int64_t value) { #if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT return Atomic::Add64(location1, value); #else FastAutoLock lock(&m_Atomic64Mutex); return *location1 += value; #endif }
Again we see the same pattern, so let’ just look at the atomic version:
inline int64_t UnityPalAdd64(volatile int64_t* location1, int64_t value) { ASSERT_ALIGNMENT(location1, 8); return __sync_add_and_fetch(location1, value); }
The GCC docs describe __sync_add_and_fetch
like this:
These builtins perform the operation suggested by the name, and return the new value. That is,
{ *ptr op= value; return *ptr; }
{ *ptr = ~(*ptr & value); return *ptr; } // nand
Next is Interlocked.Increment
:
int64_t Interlocked::Increment64(int64_t* location) { #if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT return Atomic::Increment64(location); #else FastAutoLock lock(&m_Atomic64Mutex); return ++(*location); #endif }
The same pattern holds here, so let’s just check out the atomic version again:
inline int64_t UnityPalIncrement64(volatile int64_t* value) { ASSERT_ALIGNMENT(value, 8); return __sync_add_and_fetch(value, 1); }
This version uses the same __sync_add_and_fetch
, except with a constant 1
value.
Finally, let’s see Interlocked.Read
:
int64_t Interlocked::Read(int64_t* location) { #if IL2CPP_ENABLE_INTERLOCKED_64_REQUIRED_ALIGNMENT return Atomic::Read64(location); #else FastAutoLock lock(&m_Atomic64Mutex); return *location; #endif }
And here’s the atomic version:
inline int64_t UnityPalRead64(volatile int64_t* addr) { ASSERT_ALIGNMENT(addr, 8); return __sync_fetch_and_add(addr, 0); }
Instead of __sync_add_and_fetch
, this uses __sync_fetch_and_add
but adds the value 0
so there’s no effect on the variable. The GCC docs describe the builtin like this:
These builtins perform the operation suggested by the name, and returns the value that had previously been in memory. That is,
{ tmp = *ptr; *ptr op= value; return tmp; }
{ tmp = *ptr; *ptr = ~(tmp & value); return tmp; } // nand
Conclusion
The combination of IL2CPP, Clang (Xcode’s compiler), and atomic operation builtins has produced a rather efficient system to support multi-threading in C#. lock
blocks boil down to just a couple of calls to well-optimized Monitor.Enter
and Monitor.Exit
functions. volatile
variable access is only minimally encumbered by full memory barriers when just read or write would do, but this is more of a nit-pick than a serious performance concern for C# scripts. The [ThreadStatic]
attribute simply associates an array with a Thread
object and indexes into it when you access the variable. Finally, the various static functions of the Interlocked
class are all hand-written and well optimized for a variety of target platforms. This is one area of Unity scripting where there’s not much need to worry about stepping on any performance-destroying landmines.
#1 by Guavaman on April 29th, 2020 ·
I know this article is a couple of years old, but I ran into one such performance-destroying landmine in IL2CPP on Windows Standalone, Unity 2018.4.20f1. Lock, or more specifically, Monitor.Exit. In Mono, there is essentially no overhead at all in using a lock statement if only one thread is accessing that object. However, in IL2CPP, there is significant overhead. I had an application that was making probably hundreds of calls per frame to a thread-safe class that used locking for synchronization. This application ran at 100 fps in Mono and 43 fps in IL2CPP. When I removed the use of lock, IL2CPP suddenly ran at 156 fps — a 3.6x performance boost simply from removing lock statements. This is a land mine if I’ve ever seen one.
#2 by Guavaman on April 29th, 2020 ·
Interlocked.Exchange and SpinWait are MUCH faster in IL2CPP with barely any performance overhead when no other thread is trying to access the object.
#3 by jackson on May 4th, 2020 ·
Thanks for posting these comments! They inspired today’s article.
#4 by Eoin O'Grady on September 3rd, 2020 ·
Hey Jack,
Just wanted to get your opinion on using a lock statement vs Interlocked Compare and Exchange which one will have the least cpu overhead? It reads like the Interlocked functions will
#5 by jackson on September 5th, 2020 ·
Interlocked
is faster than locks, but serve a somewhat different purpose. Make sure you’re using the one appropriate for your task.