C# has some powerful features like fixed-size buffers, pointers, and unmanaged local variable arrays courtesy of stackalloc. These are deemed “unsafe” since they all deal with unmanaged memory. We should know what we’re ultimately instructing the CPU to execute when we use these features, so today we’ll take a look at the C++ output from IL2CPP and the assembly output from the C++ compiler to find out just that.

Defreferencing Pointers

Let’s start simple by looking at what happens when we dereference a pointer in C#:

static unsafe class TestClass
{
	static int DereferencePointer(int* x)
	{
		return *x;
	}
}

IL2CPP in Unity 2017.3 turns this C# into the following C++:

extern "C"  int32_t TestClass_DereferencePointer_m3073840890 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method)
{
	{
		int32_t* L_0 = ___x0;
		return (*((int32_t*)L_0));
	}
}

This is a pretty literal translation that isn’t adding much. There’s an unnecessary code block ({}), an unnecessary copy from the ___x0 parameter to the L_0 local variable and an unnecessary cast from int32_t* to int32_t*, but otherwise this function is just dereferencing the pointer as we did in C#.

Let’s see what Xcode 9.2’s C++ compiler turns this into when it generates ARM machine code:

ldr	r0, [r1]
bx	lr

All of that syntax just boils down to reading the memory at the pointer’s address (a.k.a. dereferencing it) then returning the value that was read. This is the minimal assembly and neither IL2CPP nor the C++ compiler have added any overhead at all. Great!

Indexing Unmanaged Arrays

Pointers can also be thought of as the address of the first element of an array. Let’s try using a pointer like an array using the p[x] syntax:

static unsafe class TestClass
{
	static int IndexPointer(int* x)
	{
		return x[3];
	}
}

Here’s the IL2CPP output:

extern "C"  int32_t TestClass_IndexPointer_m1288887649 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method)
{
	{
		int32_t* L_0 = ___x0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_0, (int32_t)((int32_t)12)))));
	}
}

Here we see that IL2CPP has multiplied the index (3) by the size of the int elements of the array (4) to get an offset in bytes: 12. Aside from a bunch of unnecessary casts and code blocks, we see a call to il2cpp_codegen_add which looks like this:

template<typename T, typename U>
inline typename pick_bigger<T, U>::type il2cpp_codegen_add(T left, U right)
{
	return left + right;
}

So this just adds the two parameters together and returns a pick_bigger<T, U>::type. That type is defined by the combination of three template structs using a C++ technique called metaprogramming:

template<class T, class U>
struct pick_first<true, T, U>
{
	typedef T type;
};
 
template<class T, class U>
struct pick_first<false, T, U>
{
	typedef U type;
};
 
template<class T, class U>
struct pick_bigger
{
	typedef typename pick_first<(sizeof(T) >= sizeof(U)), T, U>::type type;
};

This boils down to pick_bigger<T, U>::type being the larger of the T and U types. So if il2cpp_codegen_add were to add an int32_t and an int64_t, it would return an int64_t since it’s bigger.

Now let’s see how all this metaprogramming and extra syntax boils down to ARM assembly code:

ldr	r0, [r1, #12]
bx	lr

All of that got stripped out and we’re left with just one read from 12 bytes after the pointer and then returning the value that was read. Again, this is the minimal work so the output is great!

Casting Pointers

Next up, let’s cast a pointer to one type to a pointer to another type. We’ve seen before that casting objects can be quite expensive, but does this hold true for pointers? Let’s try:

static unsafe class TestClass
{
	static float CastPointer(int* x)
	{
		return *(float*)x;
	}
}

Here’s what IL2CPP outputs:

extern "C"  float TestClass_CastPointer_m222125499 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method)
{
	{
		int32_t* L_0 = ___x0;
		return (*((float*)L_0));
	}
}

Aside from the pointless local variable and code block, this is a literal translation of the C#. There’s no dynamic type checking going on here, unlike with object casting. Let’s see how this gets compiled into ARM assembly:

	ldr	r0, [r1]
	bx	lr

This is exactly the same assembly as with just dereferencing the pointer without a cast, as it should be. We simply read from memory at the pointer’s address then return the value. Great!

Fixing a Pointer to a Managed Array

The memory manager in C# is allowed to move objects around as it pleases. This is possible since C# references aren’t a literal memory address, so no pointers will be invalidated by the move. This also implies that if we want a pointer to a managed object that we need to use the fixed block to temporarily prevent the object from being moved while we use its current location in memory. Let’s try that by getting a pointer to the first element of a managed array:

static unsafe class TestClass
{
	static int* FixedBlock(int[] x)
	{
		fixed (int* p = x)
		{
			return p;
		}
	}
}

The following is the C++ that IL2CPP outputs. It’s a lot longer, so I’ve annotated it with comments to explain what’s going on.

extern "C"  int32_t* TestClass_FixedBlock_m2981938384 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___x0, const RuntimeMethod* method)
{
	// A temporary that's used later
	int32_t* V_0 = NULL;
 
	// This will be the return value
	uintptr_t G_B4_0 = 0;
 
	{
		// If the array is null, skip to assigning the return value to null
		Int32U5BU5D_t385246372* L_0 = ___x0;
		if (!L_0)
		{
			goto IL_000e;
		}
	}
	{
		// The array is not null because we didn't execute the above 'goto'
		// Null check the array (redundantly)
		Int32U5BU5D_t385246372* L_1 = ___x0;
		NullCheck(L_1);
 
		// If the array is not empty, skip to getting the first element's address
		if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length)))))
		{
			goto IL_0015;
		}
	}
 
IL_000e:
	{
		// Set null and then go return it
		G_B4_0 = (((uintptr_t)0));
		goto IL_001c;
	}
 
IL_0015:
	{
		// The array is not null because two null checks have passed and we also dereferenced it
		// Null check the array (for the third time)
		Int32U5BU5D_t385246372* L_2 = ___x0;
		NullCheck(L_2);
 
		// Set the return value to the address of the first element
		G_B4_0 = ((uintptr_t)(((L_2)->GetAddressAt(static_cast<il2cpp_array_size_t>(0)))));
	}
 
IL_001c:
	{
		// Cast to int* and return
		V_0 = (int32_t*)G_B4_0;
		int32_t* L_3 = V_0;
		return (int32_t*)(L_3);
	}
}

GetAddressAt and its helper macro look like this:

inline int32_t* GetAddressAt(il2cpp_array_size_t index)
{
	IL2CPP_ARRAY_BOUNDS_CHECK(index, (uint32_t)(this)->max_length);
	return m_Items + index;
}
 
// Performance optimization as detailed here: http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx
// Since array size is a signed int32_t, a single unsigned check can be performed to determine if index is less than array size.
// Negative indices will map to a unsigned number greater than or equal to 2^31 which is larger than allowed for a valid array.
#define IL2CPP_ARRAY_BOUNDS_CHECK(index, length) \
    do { \
        if (((uint32_t)(index)) >= ((uint32_t)length)) il2cpp::vm::Exception::Raise (il2cpp::vm::Exception::GetIndexOutOfRangeException()); \
    } while (0)

So there are some redundant checks. We checked for a null array three times and a valid index twice. There are also a lot of extra code blocks and pointless local variable copies, but that hasn’t tripped up the C++ compiler so far. Let’s see what ARM assembly it generates in this case. I’ve annotated it to explain what’s going on.

	cbz	r1, LBB3_2      // if array is null, go to LBB3_2
	ldr	r0, [r1, #12]   // read array's max_length
	cmp	r0, #0          // compare max_length with 0
	it	ne              // if max_length isn't zero
	addne.w	r0, r1, #16 // set the first element's address as return value if max_length isn't zero
	bx	lr              // return
LBB3_2:
	movs	r0, #0      // set null as return value
	bx	lr              // return

The compiler has removed the second two null checks and the redundant bounds check leaving just one array check and one empty array check. These are sensible checks for general-purpose code, but what if we know the array isn’t null and isn’t empty? Can we remove these checks with [Il2CppSetOption] attributes? Let’s try:

static unsafe class TestClass
{
	[Il2CppSetOption(Option.NullChecks, false)]
	[Il2CppSetOption(Option.ArrayBoundsChecks, false)]
	static int* FixedBlockNoNullChecksNoRangeChecks(int[] x)
	{
		fixed (int* p = x)
		{
			return p;
		}
	}
}

Here’s the C++ we get out of IL2CPP:

extern "C"  int32_t* TestClass_FixedBlockNoNullChecksNoRangeChecks_m1091582187 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___x0, const RuntimeMethod* method)
{
	int32_t* V_0 = NULL;
	uintptr_t G_B4_0 = 0;
	{
		Int32U5BU5D_t385246372* L_0 = ___x0;
		if (!L_0)
		{
			goto IL_000e;
		}
	}
	{
		Int32U5BU5D_t385246372* L_1 = ___x0;
		if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length)))))
		{
			goto IL_0015;
		}
	}
 
IL_000e:
	{
		G_B4_0 = (((uintptr_t)0));
		goto IL_001c;
	}
 
IL_0015:
	{
		Int32U5BU5D_t385246372* L_2 = ___x0;
		G_B4_0 = ((uintptr_t)(((L_2)->GetAddressAtUnchecked(static_cast<il2cpp_array_size_t>(0)))));
	}
 
IL_001c:
	{
		V_0 = (int32_t*)G_B4_0;
		int32_t* L_3 = V_0;
		return (int32_t*)(L_3);
	}
}

Both NullCheck calls have been removed and the call to GetAddressAt has been converted to GetAddressAtUnchecked, as we’d expect from [Il2CppSetOption]. However, we still have a manual null check (if (!L_0)) and a manual length check (if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length)))))) that didn’t get removed. How will this affect the ARM assembly? Let’s see:

	cbz	r1, LBB6_2
	ldr	r0, [r1, #12]
	cmp	r0, #0
	it	ne
	addne.w	r0, r1, #16
	bx	lr
LBB6_2:
	movs	r0, #0
	bx	lr

This assembly code is exactly the same as without the [Il2CppSetOption] attributes. The C++ compiler was already stripping out the redundant null and bounds checks, so adding attributes didn’t change the code the CPU will execute. It’s unfortunate that we can’t get rid of the two remaining checks, but at least we don’t need to remember to add attributes to get the best machine code output.

Fixing a Pointer to a Fixed Buffer

There’s another meaning of the fixed keyword. It can be used to create a fixed-length buffer of primitives as a direct field of a struct. So fixed float pos[3] is just like typing float x, y, z: the values are added directly to the struct’s contents. Unlike individual values, fixed-length buffers can be indexed into like a managed array. To do so, we once again need to use the other fixed keyword to fix a pointer since the managed object containing the struct might be moved by the memory manager. Let’s try that now:

unsafe struct TestStruct
{
	public fixed int FixedBuffer[10];
 
	public int UseFixedBuffer()
	{
		fixed (int* f = FixedBuffer)
		{
			return f[3];
		}
	}
}

Here’s the C++ that IL2CPP outputs:

extern "C"  int32_t TestStruct_UseFixedBuffer_m1456987545 (TestStruct_t512363622 * __this, const RuntimeMethod* method)
{
	int32_t* V_0 = NULL;
	{
		U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_0 = __this->get_address_of_FixedBuffer_0();
		int32_t* L_1 = L_0->get_address_of_FixedElementField_0();
		V_0 = (int32_t*)L_1;
		int32_t* L_2 = V_0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_2, (int32_t)((int32_t)12)))));
	}
}
 
struct  U3CFixedBufferU3E__FixedBuffer0_t1481979028 
{
public:
	union
	{
		struct
		{
			// System.Int32 TestStruct/<FixedBuffer>__FixedBuffer0::FixedElementField
			int32_t ___FixedElementField_0;
		};
		uint8_t U3CFixedBufferU3E__FixedBuffer0_t1481979028__padding[40];
	};
 
public:
	inline static int32_t get_offset_of_FixedElementField_0() { return static_cast<int32_t>(offsetof(U3CFixedBufferU3E__FixedBuffer0_t1481979028, ___FixedElementField_0)); }
	inline int32_t get_FixedElementField_0() const { return ___FixedElementField_0; }
	inline int32_t* get_address_of_FixedElementField_0() { return &___FixedElementField_0; }
	inline void set_FixedElementField_0(int32_t value)
	{
		___FixedElementField_0 = value;
	}
};

We can see that the generated struct has 40 bytes (uint8_t) of data, which is the capacity for 10 int values in the fixed-length buffer. Various accessor functions were generated and a couple of them used in UseFixedBuffer. il2cpp_codegen_add returns here to offset the pointer to the first element by 12, which is 3 elements of 4 bytes each.

Notably missing are the null and length checks. Because fixed-length buffers cannot be null and cannot be empty, there’s simply no reason to check for these cases. Let’s see how this translates to assembly code via the C++ compiler:

	ldr	r0, [r0, #12]
	bx	lr

This is identical to the assembly code that was generated when we indexed into a pointer. There are no null checks, no range checks, and no length checks as we saw above with managed arrays. Great!

Now let’s change up how we use the fixed buffer. Instead of using it from within the struct, let’s use it from another class.

static unsafe class TestClass
{
	static int FixedBufferFromPointer(TestStruct* x)
	{
		fixed (int* p = x->FixedBuffer)
		{
			return p[3];
		}
	}
}

This is the same code, except it uses a pointer parameter instead of the this pointer to access the fixed-length buffer. Let’s see the C++:

extern "C"  int32_t TestClass_FixedBufferFromPointer_m2376647269 (RuntimeObject * __this /* static, unused */, TestStruct_t512363622 * ___x0, const RuntimeMethod* method)
{
	int32_t* V_0 = NULL;
	{
		TestStruct_t512363622 * L_0 = ___x0;
		NullCheck(L_0);
		U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_1 = L_0->get_address_of_FixedBuffer_0();
		int32_t* L_2 = L_1->get_address_of_FixedElementField_0();
		V_0 = (int32_t*)L_2;
		int32_t* L_3 = V_0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_3, (int32_t)((int32_t)12)))));
	}
}

We get a null check now, but otherwise the code is the same as when the function was within the struct. Let’s see the assembly code:

	push	{r4, r7, lr}
	add	r7, sp, #4
	mov	r4, r1
	cbnz	r4, LBB9_2
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB9_2:
	ldr	r0, [r4, #12]
	pop	{r4, r7, pc}

Most of this is for the null check, so let’s remove it to see if we can get back to the minimal assembly that was generated for the function inside the struct:

static unsafe class TestClass
{
	[Il2CppSetOption(Option.NullChecks, false)]
	static int FixedBufferFromPointerNoNullChecks(TestStruct* x)
	{
		fixed (int* p = x->FixedBuffer)
		{
			return p[3];
		}
	}
}

Here’s the C++ that IL2CPP outputs:

extern "C"  int32_t TestClass_FixedBufferFromPointerNoNullChecks_m1201836921 (RuntimeObject * __this /* static, unused */, TestStruct_t512363622 * ___x0, const RuntimeMethod* method)
{
	int32_t* V_0 = NULL;
	{
		TestStruct_t512363622 * L_0 = ___x0;
		U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_1 = L_0->get_address_of_FixedBuffer_0();
		int32_t* L_2 = L_1->get_address_of_FixedElementField_0();
		V_0 = (int32_t*)L_2;
		int32_t* L_3 = V_0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_3, (int32_t)((int32_t)12)))));
	}
}

This is the same, except the NullCheck call has been removed. Now let’s see what this compiles to:

ldr	r0, [r1, #12]
bx	lr

We’re back to the original, minimal assembly as with the function inside the struct. Great!

Stack Allocating Arrays

Finally for today, let’s look at the stackalloc keyword. This can be used like fixed-length arrays to create an unmanaged array of local variables inside a function.

static unsafe class TestClass
{
	static int StackallocVar(int len)
	{
		int* x = stackalloc int[len];
		return x[3];
	}
}

IL2CPP outputs this C++, which I’ve annotated with comments:

extern "C"  int32_t TestClass_StackallocVar_m2549670191 (RuntimeObject * __this /* static, unused */, int32_t ___len0, const RuntimeMethod* method)
{
	int32_t* V_0 = NULL;
	{
		int32_t L_0 = ___len0;
 
		// If the stackalloc length is too long, throw an exception
		if ((uint64_t)(uint32_t)L_0 * (uint64_t)(uint32_t)4 > (uint64_t)(uint32_t)kIl2CppUInt32Max)
			IL2CPP_RAISE_MANAGED_EXCEPTION(il2cpp_codegen_get_overflow_exception());
 
		// stackalloc the array and clear it to zeroes
		int8_t* L_1 = (int8_t*) alloca(((int32_t)il2cpp_codegen_multiply((int32_t)L_0, (int32_t)4)));
		memset(L_1,0,((int32_t)il2cpp_codegen_multiply((int32_t)L_0, (int32_t)4)));
 
		// Index into the fourth element and return it
		V_0 = (int32_t*)(L_1);
		int32_t* L_2 = V_0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_2, (int32_t)((int32_t)12)))));
	}
}

The function begins with a check to make sure we’re not going to overflow the stack with an excessively large stackalloc array. Then it allocates the array, clears it to zeroes, and returns the fourth element. Let’s see how this compiles into ARM assembly:

	push	{r4, r5, r7, lr}
	add	r7, sp, #8
	sub	sp, #4
	movw	r0, :lower16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_0+4))
	cmp.w	r1, #1073741824
	movt	r0, :upper16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_0+4))
LPC12_0:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	str	r0, [r7, #-12]
	bhs	LBB12_2
	sub.w	r4, sp, r1, lsl #2
	mov	sp, r4
	lsls	r2, r1, #2
	mov	r0, r4
	movs	r1, #0
	bl	_memset
	ldr	r0, [r4, #12]
	ldr	r1, [r7, #-12]
	movw	r2, :lower16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_1+4))
	movt	r2, :upper16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_1+4))
LPC12_1:
	add	r2, pc
	ldr	r2, [r2]
	ldr	r2, [r2]
	subs	r1, r2, r1
	ittt	eq
	subeq.w	r4, r7, #8
	moveq	sp, r4
	popeq	{r4, r5, r7, pc}
	bl	___stack_chk_fail
LBB12_2:
	bl	__Z37il2cpp_codegen_get_overflow_exceptionv
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException

Most of this is the code for the call to alloca which is making room on the stack for the array, the length check and ensuing exception, and the memset to clear the array to zeroes. There’s definitely overhead to be paid here, but it’s a lot less than the overhead of allocating a managed array and, ultimately, feeding it to the GC.

If the length of the array is constant, there’s a workaround to eliminate the alloca and length checks. Just create a bunch of local variables and use the address of the first one as a pointer to the first element of the “array”:

static unsafe class TestClass
{
	static int StackallocManual(int i)
	{
		int x0, x1, x2, x3, x4, x5, x6, x7, x8, x9;
		int* x = &x0;
		return x[i];
	}
}

Here’s the C++ that IL2CPP generates:

extern "C"  int32_t TestClass_StackallocManual_m2866866183 (RuntimeObject * __this /* static, unused */, int32_t ___i0, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	int32_t V_2 = 0;
	int32_t V_3 = 0;
	int32_t V_4 = 0;
	int32_t V_5 = 0;
	int32_t V_6 = 0;
	int32_t V_7 = 0;
	int32_t V_8 = 0;
	int32_t V_9 = 0;
	int32_t* V_10 = NULL;
	{
		V_10 = (int32_t*)(&V_0);
		int32_t* L_0 = V_10;
		int32_t L_1 = ___i0;
		return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_0, (intptr_t)((intptr_t)il2cpp_codegen_multiply((intptr_t)(((intptr_t)L_1)), (int32_t)4))))));
	}
}

There’s no more length check or alloca call here. Let’s look at the assembly to see what ultimately runs on the CPU:

sub	sp, #4
movs	r0, #0
str	r0, [sp]
mov	r0, sp
ldr.w	r0, [r0, r1, lsl #2]
add	sp, #4
bx	lr

This assembly is way shorter and no longer has any branch instructions such as for length checks.

Conclusion

All of these “unsafe” language features are implemented reasonably well in both IL2CPP and the C++ compiler. In many cases, the resulting ARM assembly code is completely optimal. In particular, dereferencing pointers, indexing into an unmanaged array, and casting pointers always produces the minimal assembly code.

Indexing into fixed-length buffer fields of a struct usually produces optimal machine code, but null checks will be added if accessing the buffer via a pointer. Thankfully, they’re easily removed by adding [Il2CppSetOption(Option.NullChecks, false)] to the function using the buffer.

Unlike fixed-length buffers, indexing into a managed array via a pointer requires two branches to check for null and for empty arrays. These can’t be removed with an attribute.

The last form of array is created by stackalloc and it comes with the most overhead. There’s a length check, a call to memset to clear to zeroes, and alloca to dynamically allocate inside the stack. Working around this requires a fixed array length and some extra typing, but it’s an option when performance is really crucial.

All of these are as fast or faster than their managed equivalents and typically don’t involve as many cache misses due to pointer indirection, as much need to disable null and bounds checks, and zero interaction with the GC and memory manager. For performance-critical C#, there’s a lot of upside to using “unsafe” code.