Unity 2019.1 was released last week and the Burst compiler is now out of Preview. It promises superior performance by generating more optimal code than with IL2CPP. Let’s try it out and see if the performance lives up to the hype!

To use the Burst compiler, we need to do three main things. First, we must restrict our usage of C# to the “High-Performance C#” subset. This means we don’t use any managed objects such as classes and strings. Second, we must put all our code in a C# job such as one implementing IJob or IJobParallelFor. Third, we must add the [BurstCompile] attribute to the job. If we already did the first two steps, adding the [BurstCompile] attribute is trivial enough to be considered “free” in terms of cost to implement.

So let’s write a simple job and test its performance with and without the [BurstCompile] attribute. All this job does is perform a dot product on two arrays of vectors. We’ll use the NativeArray<T> type to hold the array and the float4 type from the newly-released Unity.Mathematics package to hold the vectors. Burst is aware of the float4 type and can generate more optimal code when we use it. Here’s the test:

using System.Diagnostics;
using Unity.Burst;
using Unity.Collections;
using Unity.Jobs;
using Unity.Mathematics;
using UnityEngine;
 
class TestScript : MonoBehaviour
{
    struct RegularJob : IJob
    {
        public NativeArray<float4> A;
        public NativeArray<float4> B;
        public NativeArray<float4> C;
 
        public void Execute()
        {
            for (int i = 0; i < A.Length; ++i)
            {
                C[i] = math.dot(A[i], B[i]);
            }
        }
    }
 
    [BurstCompile]
    struct BurstJob : IJob
    {
        public NativeArray<float4> A;
        public NativeArray<float4> B;
        public NativeArray<float4> C;
 
        public void Execute()
        {
            for (int i = 0; i < A.Length; ++i)
            {
                C[i] = math.dot(A[i], B[i]);
            }
        }
    }
 
    void Start()
    {
        const int size = 1000000;
        NativeArray<float4> a = new NativeArray<float4>(size, Allocator.TempJob);
        NativeArray<float4> b = new NativeArray<float4>(size, Allocator.TempJob);
        NativeArray<float4> c = new NativeArray<float4>(size, Allocator.TempJob);
        for (int i = 0; i < size; ++i)
        {
            a[i] = new float4(1, 1, 1, 1);
            b[i] = new float4(2, 2, 2, 2);
        }
        RegularJob regularJob = new RegularJob { A = a, B = b, C = c };
        BurstJob burstJob = new BurstJob { A = a, B = b, C = c };
 
        // First run is a warm-up. Second run generates the report.
        long regularTime = 0;
        long burstTime = 0;
        Stopwatch sw = new Stopwatch();
        for (int i = 0; i < 2; ++i)
        {
            sw.Restart();
            regularJob.Schedule().Complete();
            regularTime = sw.ElapsedTicks;
 
            sw.Restart();
            burstJob.Schedule().Complete();
            burstTime = sw.ElapsedTicks;
        }
 
        print(
            "Job,Time\n" +
            "Regular," + regularTime + "\n" +
            "Burst," + burstTime);
 
        a.Dispose();
        b.Dispose();
        c.Dispose();
    }
}

Now let’s run this code to see how each job compiler performed. I ran using this environment:

  • 2.7 Ghz Intel Core i7-6820HQ
  • macOS 10.14.4
  • Unity 2019.1.0f2
  • macOS Standalone
  • .NET 4.x scripting runtime version and API compatibility level
  • IL2CPP
  • Non-development
  • 640×480, Fastest, Windowed

Here are the results I got:

Job Time
Regular 43940
Burst 28590

Burst Compiler Performance

By just adding the [BurstCompile] attribute, we’ve got a major speedup! The Burst-compiled code takes 35% less time to run than the IL2CPP-compiled code. To find out why, let’s use the Burst Inspector to see the code it generated:

  1. Jobs > Burst > Open Inspector...
  2. Click TestScript.BurstJob on the left
  3. Check Enhanced Disassembly on the right
  4. Uncheck Safety Checks on the right
  5. Click Refresh Disassembly on the right

Here is the section for C[i] = math.dot(A[i], B[i]):

movups   xmm0, xmmword ptr [rcx + rdi]
movups   xmm1, xmmword ptr [rdx + rdi]
mulps    xmm1, xmm0
movshdup xmm0, xmm1
addps    xmm1, xmm0
movhlps  xmm0, xmm1
addps    xmm0, xmm1
shufps   xmm0, xmm0, 0
movups   xmmword ptr [rsi + rdi], xmm0

These are SIMD instructions, which tell the CPU to do the same operation on many variables at once. In this case, all four components of the float4 are operated on simulateneously to add them, multiply them, etc.

While Burst requires a change in programming style, primarily to not use classes, it provides a major performance benefit to code that complies. If the code was already written in that style, Burst offers "free" performance by simply adding the [BurstCompile] attribute to job types!