This week we’ll take a break from the C++ Scripting series to explore three optimizations we can make to our C# code so that IL2CPP generates faster C++ code for us. We’ll cover three areas that yield big speedups: casting, array bounds checking, and null checking.

Let’s start in today by looking at what IL2CPP generates for the two types of C# casts: normal ((MyClass)x) and the as operator (x as MyClass). Here’s a tiny amount of C# code that uses both kinds of casts.

class MyClass
{
}
 
static class TestClass
{
	static MyClass NormalCast(object obj)
	{
		return (MyClass)obj;
	}
 
	static MyClass AsCast(object obj)
	{
		return obj as MyClass;
	}
}

Now let’s make a build with Unity 2017.3.0f3’s IL2CPP for the iOS target using Release mode and “Script Call Optimization” set to “Fast but no Exceptions.” When we open the Xcode project and search for “TestClass::NormalCast” we find the generated C++ function:

// MyClass TestClass::NormalCast(System.Object)
extern "C"  MyClass_t3388352440 * TestClass_NormalCast_m842078812 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_NormalCast_m842078812_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		RuntimeObject * L_0 = ___obj0;
		return ((MyClass_t3388352440 *)CastclassClass((RuntimeObject*)L_0, MyClass_t3388352440_il2cpp_TypeInfo_var));
	}
}

There are several interesting aspects of this function. First, it’s using a static bool to keep track of whether the function has been initialized. This is essentially a global variable and therefore very likely to not currently be in the CPU’s caches (L1, L2, L3) unless this function was called very recently. The resulting cache miss will likely cost about 100 nanoseconds, enough time for typical CPUs to perform dozens or even hundreds of math instructions. It’s also using the MyClass_t3388352440_il2cpp_TypeInfo_var global variable, which is also likely not in the CPU’s cache when this function is called. That’s dozens or hundreds of instructions that also might be wasted. Of course the first call to this function will also result in a call to il2cpp_codegen_initialize_method which will surely take even more time. So there’s quite a lot of overhead to this function every time it’s called.

Second, the actual cast is done by CastclassClass. Let’s “jump to definition” and see what that looks like:

inline RuntimeObject* CastclassClass(RuntimeObject *obj, RuntimeClass* targetType)
{
    if (!obj)
        return NULL;
 
    RuntimeObject* result = IsInstClass(obj, targetType);
    if (result)
        return result;
 
    RaiseInvalidCastException(obj, targetType);
    return NULL;
}

This does at least one if check and probably two if the object isn’t null. It also calls IsInstClass, so let’s go see what that looks like:

inline RuntimeObject* IsInstClass(RuntimeObject *obj, RuntimeClass* targetType)
{
#if IL2CPP_DEBUG
    IL2CPP_ASSERT((targetType->flags & TYPE_ATTRIBUTE_INTERFACE) == 0);
#endif
    if (!obj)
        return NULL;
 
    // optimized version to compare classes
    return il2cpp::vm::Class::HasParentUnsafe(obj->klass, targetType) ? obj : NULL;
}

Assuming IL2CPP_DEBUG isn’t defined for our release build, this function does a redundant if check for a null object and then calls il2cpp::vm::Class::HasParentUnsafe with the klass field of RuntimeObject. RuntimeObject is an alias for Il2CppObject and klass is its very first field, so we’ll be fetching a CPU cache line (e.g. 32 or 64 bytes) starting right at the beginning of the object. It’s good to keep that in mind for future memory reads so we can decide if they’re likely to be a cache “hit” or “miss.”

Now let’s look at HasParentUnsafe:

static bool HasParentUnsafe(const Il2CppClass* klass, const Il2CppClass* parent)
{
	return klass->typeHierarchyDepth >= parent->typeHierarchyDepth
		&& klass->typeHierarchy[parent->typeHierarchyDepth - 1] == parent;
}

Here we have two pointers to either the same Il2CppClass or different ones. If they’re the same, we’ll be accessing the same memory multiple times and therefore have a cache “miss” on the first access and a cache “hit” on the second since we just read it. If they’re different, we’ll have two cache “misses.” Let’s see what looks like so we know where in its list of fields we’re accessing memory for:

struct Il2CppClass
{
	// The following fields are always valid for a Il2CppClass structure
	const Il2CppImage* image;
	void* gc_desc;
	const char* name;
	const char* namespaze;
	const Il2CppType* byval_arg;
	const Il2CppType* this_arg;
	Il2CppClass* element_class;
	Il2CppClass* castClass;
	Il2CppClass* declaringType;
	Il2CppClass* parent;
	Il2CppGenericClass *generic_class;
	const Il2CppTypeDefinition* typeDefinition; // non-NULL for Il2CppClass's constructed from type defintions
	const Il2CppInteropData* interopData;
	// End always valid fields
 
	// The following fields need initialized before access. This can be done per field or as an aggregate via a call to Class::Init
	FieldInfo* fields; // Initialized in SetupFields
	const EventInfo* events; // Initialized in SetupEvents
	const PropertyInfo* properties; // Initialized in SetupProperties
	const MethodInfo** methods; // Initialized in SetupMethods
	Il2CppClass** nestedTypes; // Initialized in SetupNestedTypes
	Il2CppClass** implementedInterfaces; // Initialized in SetupInterfaces
	Il2CppRuntimeInterfaceOffsetPair* interfaceOffsets; // Initialized in Init
	void* static_fields; // Initialized in Init
	const Il2CppRGCTXData* rgctx_data; // Initialized in Init
	// used for fast parent checks
	Il2CppClass** typeHierarchy; // Initialized in SetupTypeHierachy
	// End initialization required fields
 
	uint32_t cctor_started;
	uint32_t cctor_finished;
	ALIGN_TYPE(8) uint64_t cctor_thread;
 
	// Remaining fields are always valid except where noted
	GenericContainerIndex genericContainerIndex;
	CustomAttributeIndex customAttributeIndex;
	uint32_t instance_size;
	uint32_t actualSize;
	uint32_t element_size;
	int32_t native_size;
	uint32_t static_fields_size;
	uint32_t thread_static_fields_size;
	int32_t thread_static_fields_offset;
	uint32_t flags;
	uint32_t token;
 
	uint16_t method_count; // lazily calculated for arrays, i.e. when rank > 0
	uint16_t property_count;
	uint16_t field_count;
	uint16_t event_count;
	uint16_t nested_type_count;
	uint16_t vtable_count; // lazily calculated for arrays, i.e. when rank > 0
	uint16_t interfaces_count;
	uint16_t interface_offsets_count; // lazily calculated for arrays, i.e. when rank > 0
 
	uint8_t typeHierarchyDepth; // Initialized in SetupTypeHierachy
	uint8_t genericRecursionDepth;
	uint8_t rank;
	uint8_t minimumAlignment;
	uint8_t packingSize;
 
	uint8_t valuetype : 1;
	uint8_t initialized : 1;
	uint8_t enumtype : 1;
	uint8_t is_generic : 1;
	uint8_t has_references : 1;
	uint8_t init_pending : 1;
	uint8_t size_inited : 1;
	uint8_t has_finalize : 1;
	uint8_t has_cctor : 1;
	uint8_t is_blittable : 1;
	uint8_t is_import_or_windows_runtime : 1;
	uint8_t is_vtable_initialized : 1;
	VirtualInvokeData vtable[IL2CPP_ZERO_LEN_ARRAY];
};

This is a really big class! The first access is to typeHierarchyDepth which we can see is quite a ways down in the list of fields. Regardless, it was likely to be a “cache miss” anyhow so it doesn’t really matter. Unfortunately, the second access is to typeHierarchy which is before typeHierarchyDepth. Since the CPU cache is filled in “lines”—blocks of about 32 or 64 bytes—we need to count the distance in bytes between these two fields. It turns out there are 60 bytes of fields in between these two. Adding in the size of the typeHierarchy pointer as either 4 (32-bit systems) or 8 (64-bit systems) plus the one byte for typeHierarchyDepth, we’d need to fit either either 65 or 69 bytes in the cache line. Since that’s bigger than the x86 size (64 bytes) or the ARM size (32 bytes), we can assume another cache “miss” to access the typeHierarchy field.

There’s also the actual indexing into the typeHierarchy array. It’s unclear where the array is, but it’s safe to assume another cache “miss” here as well. All in all we count the cost of this function at four or five cache misses plus three branches (if) and a shred of time for other work like potentially calling functions, subtracting one, and so forth. At about 100 nanoseconds per cache miss, we’re probably waiting on RAM for 500 nanoseconds. To compare, consider a 2 GHz ARM chip which can perform a 32-bit floating-point square root in 14 cycles or 7 nanoseconds. So we spent enough time stalled to compute 71 square roots. Casting is therefore not cheap in IL2CPP!

Now let’s look at the as operator cast:

// MyClass TestClass::AsCast(System.Object)
extern "C"  MyClass_t3388352440 * TestClass_AsCast_m23830982 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_AsCast_m23830982_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		RuntimeObject * L_0 = ___obj0;
		return ((MyClass_t3388352440 *)IsInstClass((RuntimeObject*)L_0, MyClass_t3388352440_il2cpp_TypeInfo_var));
	}
}

This looks exactly the same as the NormalCast version, except it calls straight to IsInstClass instead of first calling through CastclassClass. This means we skip one if, but it’s otherwise about the same cost as before. Since these are the only two casts in C#, the takeaway here is that they are both quite slow and should be avoided if at all possible. Optimize by avoiding casts in performance-critical areas of the code.

Next up we’ll look at null checks using this trivial C# function:

static class TestClass
{
	static string NullCheck(object obj)
	{
		return obj.ToString();
	}
}

Now let’s look at the IL2CPP output:

// System.String TestClass::NullCheck(System.Object)
extern "C"  String_t* TestClass_NullCheck_m1438153993 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	{
		RuntimeObject * L_0 = ___obj0;
		NullCheck(L_0);
		String_t* L_1 = VirtFuncInvoker0< String_t* >::Invoke(3 /* System.String System.Object::ToString() */, L_0);
		return L_1;
	}
}

As we’re all aware, C# will throw a NullReferenceException whenever we access a field of a null object. To do this, null must first be checked for and it’s right there before the access in the form of NullCheck. Let’s take a look at that:

inline void NullCheck(void* this_ptr)
{
    if (this_ptr != NULL)
        return;
 
    il2cpp::vm::Exception::RaiseNullReferenceException();
#if !IL2CPP_TARGET_IOS
    il2cpp_codegen_no_return();
#endif
}

This is exactly what we’d expect to see: a simple if followed by throwing an exception. The C++ compiler may or may not actually inline this function, but regardless there will be an if for every time we try to access a field of the object. This branching instruction’s overhead may be too much in some performance-critical areas, so we’d like to have a way to turn it off when we know that the object will never be null anyhow.

It turns out that Unity provides a way to turn off null checks in the IL2CPP output. All we need to do is copy Il2CppSetOptionAttribute.cs out of the Unity installation into our project’s Assets folder. For example, on macOS it can be found in /path/to/Unity.app/Contents/il2cpp. With that file in place, we now have access to an attribute that lets us turn off null checks on a function-by-function basis:

using Unity.IL2CPP.CompilerServices;
 
static class TestClass
{
	[Il2CppSetOption(Option.NullChecks, false)]
	static string NoNullCheck(object obj)
	{
		return obj.ToString();
	}
}

Now let’s look at the IL2CPP output to confirm that the null check has been removed:

// System.String TestClass::NoNullCheck(System.Object)
extern "C"  String_t* TestClass_NoNullCheck_m598864280 (RuntimeObject * __this /* static, unused */, RuntimeObject * ___obj0, const RuntimeMethod* method)
{
	{
		RuntimeObject * L_0 = ___obj0;
		String_t* L_1 = VirtFuncInvoker0< String_t* >::Invoke(3 /* System.String System.Object::ToString() */, L_0);
		return L_1;
	}
}

The function is now exactly the same except for that the NullCheck function call is no longer generated by IL2CPP. This means we have an easy and effective way to opt-out of null checks when we’re sure that we’ll never access a null object. This can provide a nice speedup and will result in a crash rather than a NullReferenceException which would probably be just as bad as a crash anyhow.

Now let’s see what else we can turn off! The Options enum has an ArrayBoundsChecks enumerator to turn off the bounds checks that result in an exception in the case where we try to access beyond the beginning or end of a C# array. Let’s try a couple of C# functions to see this in action.

static class TestClass
{
	static int BoundsCheck(int[] arr)
	{
		return arr[0];
	}
 
	[Il2CppSetOption(Option.ArrayBoundsChecks, false)]
	static int NoBoundsCheck(int[] arr)
	{
		return arr[0];
	}
}

First let’s look at the default version that still has the bounds check:

// System.Int32 TestClass::BoundsCheck(System.Int32[])
extern "C"  int32_t TestClass_BoundsCheck_m2335954867 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___arr0, const RuntimeMethod* method)
{
	{
		Int32U5BU5D_t385246372* L_0 = ___arr0;
		NullCheck(L_0);
		int32_t L_1 = 0;
		int32_t L_2 = (L_0)->GetAt(static_cast<il2cpp_array_size_t>(L_1));
		return L_2;
	}
}

This function really doesn’t do anything except call GetAt to index into the array, so let’s look at that function:

inline int32_t GetAt(il2cpp_array_size_t index) const
{
	IL2CPP_ARRAY_BOUNDS_CHECK(index, (uint32_t)(this)->max_length);
	return m_Items[index];
}

This is a method of the array class (e.g. Int32U5BU5D_t385246372 for int[]) and it derives from RuntimeArray, an alias for Il2CppArray, which has the max_length field. That means m_Items comes right after it in memory, so it’s likely in the same cache line and will be a “hit.” Still, we have the IL2CPP_ARRAY_BOUNDS_CHECK macro to look at in order to see the actual bounds check:

// Performance optimization as detailed here: http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx
// Since array size is a signed int32_t, a single unsigned check can be performed to determine if index is less than array size.
// Negative indices will map to a unsigned number greater than or equal to 2^31 which is larger than allowed for a valid array.
#define IL2CPP_ARRAY_BOUNDS_CHECK(index, length) \
    do { \
        if (((uint32_t)(index)) >= ((uint32_t)length)) il2cpp::vm::Exception::Raise (il2cpp::vm::Exception::GetIndexOutOfRangeException()); \
    } while (0)

As the comment says, there’s only a need for one if here. The do-while loop is a common macro trick and should be removed by the compiler, so it won’t cost anything. That means all we’re really paying for here is the single if. Still, this can be expensive in performance-critical code just like the null checks were. So let’s look at what IL2CPP generates when we turn off the bounds check:

// System.Int32 TestClass::NoBoundsCheck(System.Int32[])
extern "C"  int32_t TestClass_NoBoundsCheck_m3043388773 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___arr0, const RuntimeMethod* method)
{
	{
		Int32U5BU5D_t385246372* L_0 = ___arr0;
		NullCheck(L_0);
		int32_t L_1 = 0;
		int32_t L_2 = (L_0)->GetAtUnchecked(static_cast<il2cpp_array_size_t>(L_1));
		return L_2;
	}
}

GetAt has been replaced by GetAtUnchecked, so let’s look at it:

inline int32_t GetAtUnchecked(il2cpp_array_size_t index) const
{
	return m_Items[index];
}

This is just like GetAt except that the IL2CPP_ARRAY_BOUNDS_CHECK macro call has been removed. It’s exactly as we’d expect: a simple, effective way to get rid of array bounds checks. Unfortunately, there’s no way to put this attribute on .NET classes like List<T> or Dictionary<TKey, TValue> that internally use arrays. So we’ll have to stick with plain arrays or our own List-like types to access this optimization.

We should also be able to combine these two options to turn off both null and bounds checks.

static class TestClass
{
	[Il2CppSetOption(Option.NullChecks, false)]
	[Il2CppSetOption(Option.ArrayBoundsChecks, false)]
	static int NoBoundsCheckNoNullCheck(int[] arr)
	{
		return arr[0];
	}
}

To confirm, let’s look at the IL2CPP output once again:

// System.Int32 TestClass::NoBoundsCheckNoNullCheck(System.Int32[])
extern "C"  int32_t TestClass_NoBoundsCheckNoNullCheck_m2913714251 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___arr0, const RuntimeMethod* method)
{
	{
		Int32U5BU5D_t385246372* L_0 = ___arr0;
		int32_t L_1 = 0;
		int32_t L_2 = (L_0)->GetAtUnchecked(static_cast<il2cpp_array_size_t>(L_1));
		return L_2;
	}
}

As expected, the NullCheck function call has been removed and GetAt has been replaced by GetAtUnchecked.

We’ve seen three optimizations we can make to our C# code to avoid costly work generated by IL2CPP. First, we saw that both kinds of casting are very expensive and should be avoided if at all possible in performance-critical code. Second, we saw how to turn off null checks for those times where we know that we’ll never access a null value or prefer to simply crash if we do. Lastly, we saw how to disable array bounds checks if we happen to know that our indices will always be valid. Combining all these techniques together can yield some serious speedups in performance “hot spots,” so keep them in mind next time you’re tuning your scripting performance.