Threads allow us do more than one thing at a time using the CPU’s cores, but it turns out we can do more than one thing at a time using just a single core! Today’s article shows you how you can do this and the huge speed boost it can give you!

So far we’ve seen how tightly packing arrays of structs can maximize our usage of the CPU’s data cache, how running multiple threads can maximize our usage of the CPU’s cores, and how avoiding virtual functions and delegates can maximize our usage of the CPU’s instruction cache. Today we’ll see how to maximize our usage of yet-another feature of modern CPUs: vector instructions.

Vector instructions are also known as SIMD: Single Instruction, Multiple Data. Instead of telling the CPU to multiply two numbers together, you tell it to multiply four pairs of two numbers together. CPUs these days can do that operation just as fast as they can multiply just two numbers together.

X Y Z W
Input 1 2 3 4
Input 10 20 30 40
Output 10 40 90 160

Structs, again, are immensely helpful here. Unity’s Vector3 and Vector4 structs guarantee that x is immediately followed by y which is in turn immediately followed by z and w in the case of Vector4. That contiguous, sequential layout is just what CPUs love. They’re designed to process these kinds of blocks of data.

The unfortunate part is that Unity doesn’t provide us an environment where we can easily write code to take advantage of SIMD. These CPU instructions tend to be specific to particular CPU sub-architectures. For example, x86 has MMX and SSE instruction sets and ARM has NEON. C# wants to abstract the CPU architecture, so it doesn’t provide any way for us to write code that calls these instructions. There is a Mono.Simd.dll that ships with Unity and you can include it in your project, but IL2CPP will ignore the special attributes it has on its functions and generate plain old C++ with no SIMD instructions.

So, in order to take advantage of SIMD in our Unity games we need to write some native code: C, C++, Objective-C, etc. This is relatively easy to do in limited circumstances where we need maximum performance. For example, on iOS you can just drop a .c file in your Assets/Plugins/iOS folder and use P/Invoke to call the C functions from C#.

For demonstration purposes, we’ll continue with the code that updates projectiles by applying a linear velocity to them. This happens to be a great application of SIMD because we’re doing two main vector operations: multiply each component of velocity by time and add each component of that to each component of position. On ARM, the UpdateProjectile function can be written like this:

// Include support for ARM's NEON "intrinsics"
// These are special functions that the compiler will
// literally turn into CPU instructions. It's like a
// C version of assembly code.
#include <arm_neon.h>
 
// Define C versions of the structs we use
typedef struct
{
	float x;
	float y;
	float z;
	float w;
} Vector4;
 
typedef struct
{
	Vector4 Position;
	Vector4 Velocity;
} ProjectileStructSimd;
 
// A C version of UpdateProjectile that uses SIMD instructions
// In this case, it uses ARM's NEON instruction set.
void UpdateProjectile(ProjectileStructSimd* projectile, float time)
{
	// Load position
	float32x4_t position = vld1q_f32(&projectile->Position.x);
 
	// Load velocity
	float32x4_t velocity = vld1q_f32(&projectile->Velocity.x);
 
	// Multiply each component of velocity by time and add the product to each
	// component of position to get the result
	float32x4_t result = vmlaq_n_f32(position, velocity, time);
 
	// Store the result in the position
	vst1q_f32(&projectile->Position.x, result);
}

As you’ll see later, there’s some overhead to calling into native code from C#. To minimize the number of calls into native code, we’ll add an additional C helper function that updates a subsection of the projectiles array. The threaded version will basically forward its work to this C function:

void UpdateProjectilesStructSimd(
	ProjectileStructSimd* firstProjectile,
	int count,
	float time
)
{
	ProjectileStructSimd** projectiles = (ProjectileStructSimd**)firstProjectile;
	int i;
	for (i = 0; i < count; ++i)
	{
		UpdateProjectile(projectiles[i], time);
	}
}

With that in mind, let’s look at the full test script:

using System;
using System.Runtime.InteropServices;
using System.Threading;
 
using UnityEngine;
 
class TestScript : MonoBehaviour
{
	struct ProjectileStruct
	{
		public Vector3 Position;
		public Vector3 Velocity;
	}
 
	struct ProjectileStructSimd
	{
		public Vector4 Position;
		public Vector4 Velocity;
	}
 
	class ProjectileClass
	{
		public Vector3 Position;
		public Vector3 Velocity;
	}
 
	class ThreadStartParamStruct
	{
		public ProjectileStruct[] ProjectileStructs;
		public int StartIndex;
		public int Count;
		public float Time;
	}
 
	class ThreadStartParamStructSimd
	{
		public ProjectileStructSimd[] ProjectileStructSimds;
		public int StartIndex;
		public int Count;
		public float Time;
	}
 
	class ThreadStartParamClass
	{
		public ProjectileClass[] ProjectileClasses;
		public int StartIndex;
		public int Count;
		public float Time;
	}
 
	void Start()
	{
		// Setup
 
		const int count = 10000000;
		float time = 0.5f;
		ProjectileStruct[] projectileStructs = new ProjectileStruct[count];
		ProjectileStructSimd[] projectileStructSimds = new ProjectileStructSimd[count];
		ProjectileClass[] projectileClasses = new ProjectileClass[count];
		for (int i = 0; i < count; ++i)
		{
			projectileClasses[i] = new ProjectileClass();
		}
		Shuffle(projectileStructs);
		Shuffle(projectileStructSimds);
		Shuffle(projectileClasses);
		int numThreads = Environment.ProcessorCount;
		int numPerThread = count / numThreads;
		Thread[] structThreads = new Thread[numThreads];
		ParameterizedThreadStart threadStartStruct = UpdateProjectilesStruct;
		ThreadStartParamStruct threadStartParamStruct = new ThreadStartParamStruct();
		Thread[] structSimdThreads = new Thread[numThreads];
		ParameterizedThreadStart threadStartStructSimd = UpdateProjectilesStructSimd;
		ThreadStartParamStructSimd threadStartParamStructSimd = new ThreadStartParamStructSimd();
		Thread[] classThreads = new Thread[numThreads];
		ParameterizedThreadStart threadStartClass = UpdateProjectilesClass;
		ThreadStartParamClass threadStartParamClass = new ThreadStartParamClass();
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i] = new Thread(threadStartStruct);
			structSimdThreads[i] = new Thread(threadStartStructSimd);
			classThreads[i] = new Thread(threadStartClass);
		}
 
		// Struct
 
		System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
		for (int i = 0; i < count; ++i)
		{
			UpdateProjectile(ref projectileStructs[i], time);
		}
		long structTime = sw.ElapsedMilliseconds;
 
		// Struct SIMD
 
		sw.Reset();
		sw.Start();
		for (int i = 0; i < count; ++i)
		{
			UpdateProjectile(ref projectileStructSimds[i], time);
		}
		long structSimdTime = sw.ElapsedMilliseconds;
 
		// Class
 
		sw.Reset();
		sw.Start();
		for (int i = 0; i < count; ++i)
		{
			UpdateProjectile(projectileClasses[i], time);
		}
		long classTime = sw.ElapsedMilliseconds;
 
		// Threaded Struct
 
		sw.Reset();
		sw.Start();
		threadStartParamStruct.ProjectileStructs = projectileStructs;
		threadStartParamStruct.StartIndex = 0;
		threadStartParamStruct.Count = numPerThread;
		threadStartParamStruct.Time = time;
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i].Start(threadStartParamStruct);
			threadStartParamStruct.StartIndex += numPerThread;
		}
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i].Join();
		}
		long threadedStructTime = sw.ElapsedMilliseconds;
 
		// Threaded Struct SIMD
 
		sw.Reset();
		sw.Start();
		threadStartParamStructSimd.ProjectileStructSimds = projectileStructSimds;
		threadStartParamStructSimd.StartIndex = 0;
		threadStartParamStructSimd.Count = numPerThread;
		threadStartParamStructSimd.Time = time;
		for (int i = 0; i < numThreads; ++i)
		{
			structSimdThreads[i].Start(threadStartParamStructSimd);
			threadStartParamStructSimd.StartIndex += numPerThread;
		}
		for (int i = 0; i < numThreads; ++i)
		{
			structSimdThreads[i].Join();
		}
		long threadedStructSimdTime = sw.ElapsedMilliseconds;
 
		// Threaded Class
 
		sw.Reset();
		sw.Start();
		threadStartParamClass.ProjectileClasses = projectileClasses;
		threadStartParamClass.StartIndex = 0;
		threadStartParamClass.Count = numPerThread;
		threadStartParamClass.Time = time;
		for (int i = 0; i < numThreads; ++i)
		{
			classThreads[i].Start(threadStartParamClass);
			threadStartParamClass.StartIndex += numPerThread;
		}
		for (int i = 0; i < numThreads; ++i)
		{
			classThreads[i].Join();
		}
		long threadedClassTime = sw.ElapsedMilliseconds;
 
		string report = string.Format(
			"Type,Time,Threaded Time\n" +
			"Struct,{0},{1}\n" +
			"Struct SIMD,{2},{3}\n" +
			"Class,{4},{5}",
			structTime,
			threadedStructTime,
			structSimdTime,
			threadedStructSimdTime,
			classTime,
			threadedClassTime
		);
		Debug.Log(report);
	}
 
	static void UpdateProjectilesStruct(object startParam)
	{
		ThreadStartParamStruct typedStartParam = (ThreadStartParamStruct)startParam;
		ProjectileStruct[] projectileStructs = typedStartParam.ProjectileStructs;
		int endIndex = typedStartParam.StartIndex + typedStartParam.Count;
		float time = typedStartParam.Time;
		for (int i = typedStartParam.StartIndex; i < endIndex; ++i)
		{
			UpdateProjectile(ref projectileStructs[i], time);
		}
	}
 
	static void UpdateProjectilesStructSimd(object startParam)
	{
		ThreadStartParamStructSimd typedStartParam = (ThreadStartParamStructSimd)startParam;
		UpdateProjectilesStructSimd(
			ref typedStartParam.ProjectileStructSimds[typedStartParam.StartIndex],
			typedStartParam.Count,
			typedStartParam.Time
		);
	}
 
	[DllImport("jni-lib")]
	extern static void UpdateProjectilesStructSimd(
		ref ProjectileStructSimd firstProjectile,
		int count,
		float time
	);
 
	static void UpdateProjectilesClass(object startParam)
	{
		ThreadStartParamClass typedStartParam = (ThreadStartParamClass)startParam;
		ProjectileClass[] projectileClasses = typedStartParam.ProjectileClasses;
		int endIndex = typedStartParam.StartIndex + typedStartParam.Count;
		float time = typedStartParam.Time;
		for (int i = typedStartParam.StartIndex; i < endIndex; ++i)
		{
			UpdateProjectile(projectileClasses[i], time);
		}
	}
 
	static void UpdateProjectile(ref ProjectileStruct projectile, float time)
	{
		projectile.Position += projectile.Velocity * time;
	}
 
	[System.Runtime.InteropServices.DllImport("jni-lib")]
	extern static void UpdateProjectile(ref ProjectileStructSimd projectile, float time);
 
	static void UpdateProjectile(ProjectileClass projectile, float time)
	{
		projectile.Position += projectile.Velocity * time;
	}
 
	public static void Shuffle<T>(T[] list)  
	{
		System.Random random = new System.Random();
		for (int n = list.Length; n > 1; )
		{
			n--;
			int k = random.Next(n + 1);
			T value = list[k];
			list[k] = list[n];
			list[n] = value;
		}
	}
}

If you want to try out the test yourself, simply paste the above code into a TestScript.cs file in your Unity project’s Assets directory and attach it to the main camera game object in a new, empty project. You’ll also need to compile a shared library (i.e. .so file) with the above C code and drop it in Assets/Plugins/Android. Now go to Player Settings and change Scripting Backend to IL2CPP. I ran it that way on this machine:

  • LG Nexus 5X
  • Android 7.1.2
  • Unity 5.6.0f3

And here are the results I got:

Type Time Threaded Time
Struct 524 185
Struct SIMD 798 12
Class 4192 1379

Performance Graph

You can see that the SIMD version of the struct is somewhat slower when single-threaded, which is likely due to the overhead of calling into native C code from C#. The threaded version only makes one call per thread (6 threads on this CPU) so that overhead is greatly reduced to the point that it’s negligible. Now we can see the performance increase of using these SIMD instructions: the struct time drops from 185 milliseconds to just 12! That’s 15x faster, which is plenty enough to tempt us into jumping through some hoops and adding native code to our games. Clearly you wouldn’t want to do this everywhere, but in certain performance-critical areas where SIMD instructions are a natural fit there is a big performance win to be had.

For some context, by using a combination of structs, threads, and SIMD we’re now running over 100x faster than the original version using classes, just the main thread, and no-SIMD C#. That’s truly a gigantic speedup! We’ve gone from an operation that took over four seconds and would require a long loading screen with no animation to something that we might be able to do during every frame (if we had a spare 12 milliseconds).

Like the data and instruction caches and multiple cores, the CPU’s vector processing is yet-another area of the the machine’s total computing power that you can exploit. It’s not useful in all cases, but if you can manage to arrange your data in just the right way you can unlock a huge speed boost!