JacksonDunstan.com

Unity 2019.1 was released last week and the Burst compiler is now out of Preview. It promises superior performance by generating more optimal code than with IL2CPP. Let’s try it out and see if the performance lives up to the hype!

To use the Burst compiler, we need to do three main things. First, we must restrict our usage of C# to the “High-Performance C#” subset. This means we don’t use any managed objects such as classes and strings. Second, we must put all our code in a C# job such as one implementing IJob or IJobParallelFor. Third, we must add the [BurstCompile] attribute to the job. If we already did the first two steps, adding the [BurstCompile] attribute is trivial enough to be considered “free” in terms of cost to implement.

So let’s write a simple job and test its performance with and without the [BurstCompile] attribute. All this job does is perform a dot product on two arrays of vectors. We’ll use the NativeArray<T> type to hold the array and the float4 type from the newly-released Unity.Mathematics package to hold the vectors. Burst is aware of the float4 type and can generate more optimal code when we use it. Here’s the test:

using System.Diagnostics;
using Unity.Burst;
using Unity.Collections;
using Unity.Jobs;
using Unity.Mathematics;
using UnityEngine;
 
class TestScript : MonoBehaviour
{
    struct RegularJob : IJob
    {
        public NativeArray<float4> A;
        public NativeArray<float4> B;
        public NativeArray<float4> C;
 
        public void Execute()
        {
            for (int i = 0; i < A.Length; ++i)
            {
                C[i] = math.dot(A[i], B[i]);
            }
        }
    }
 
    [BurstCompile]
    struct BurstJob : IJob
    {
        public NativeArray<float4> A;
        public NativeArray<float4> B;
        public NativeArray<float4> C;
 
        public void Execute()
        {
            for (int i = 0; i < A.Length; ++i)
            {
                C[i] = math.dot(A[i], B[i]);
            }
        }
    }
 
    void Start()
    {
        const int size = 1000000;
        NativeArray<float4> a = new NativeArray<float4>(size, Allocator.TempJob);
        NativeArray<float4> b = new NativeArray<float4>(size, Allocator.TempJob);
        NativeArray<float4> c = new NativeArray<float4>(size, Allocator.TempJob);
        for (int i = 0; i < size; ++i)
        {
            a[i] = new float4(1, 1, 1, 1);
            b[i] = new float4(2, 2, 2, 2);
        }
        RegularJob regularJob = new RegularJob { A = a, B = b, C = c };
        BurstJob burstJob = new BurstJob { A = a, B = b, C = c };
 
        // First run is a warm-up. Second run generates the report.
        long regularTime = 0;
        long burstTime = 0;
        Stopwatch sw = new Stopwatch();
        for (int i = 0; i < 2; ++i)
        {
            sw.Restart();
            regularJob.Schedule().Complete();
            regularTime = sw.ElapsedTicks;
 
            sw.Restart();
            burstJob.Schedule().Complete();
            burstTime = sw.ElapsedTicks;
        }
 
        print(
            "Job,Time\n" +
            "Regular," + regularTime + "\n" +
            "Burst," + burstTime);
 
        a.Dispose();
        b.Dispose();
        c.Dispose();
    }
}

Now letâ€™s run this code to see how each job compiler performed. I ran using this environment:

2.7 Ghz Intel Core i7-6820HQ
macOS 10.14.4
Unity 2019.1.0f2
macOS Standalone
.NET 4.x scripting runtime version and API compatibility level
IL2CPP
Non-development
640Ã—480, Fastest, Windowed

Here are the results I got:

Job	Time
Regular	43940
Burst	28590

Burst Compiler Performance

By just adding the [BurstCompile] attribute, we’ve got a major speedup! The Burst-compiled code takes 35% less time to run than the IL2CPP-compiled code. To find out why, let’s use the Burst Inspector to see the code it generated:

Jobs > Burst > Open Inspector...


Click TestScript.BurstJob on the left
Check Enhanced Disassembly on the right
Uncheck Safety Checks on the right
Click Refresh Disassembly on the right


Here is the section for C[i] = math.dot(A[i], B[i]):


movups   xmm0, xmmword ptr [rcx + rdi]
movups   xmm1, xmmword ptr [rdx + rdi]
mulps    xmm1, xmm0
movshdup xmm0, xmm1
addps    xmm1, xmm0
movhlps  xmm0, xmm1
addps    xmm0, xmm1
shufps   xmm0, xmm0, 0
movups   xmmword ptr [rsi + rdi], xmm0


These are SIMD instructions, which tell the CPU to do the same operation on many variables at once. In this case, all four components of the float4 are operated on simulateneously to add them, multiply them, etc.
While Burst requires a change in programming style, primarily to not use classes, it provides a major performance benefit to code that complies. If the code was already written in that style, Burst offers "free" performance by simply adding the [BurstCompile] attribute to job types!

#1 by MechEthan on April 22nd, 2019 · Reply

Would you mind explaining, in brief, how the warmup works?

I assume, just based on the term, that it results in more consistent benchmark results, but I don’t understand why.

#2 by jackson on April 22nd, 2019 · Reply

That’s right. The purpose is to avoid counting any work that’s only done on the first usage. In this case the job system and Burst may do some work the first time they execute a job. The test throws away the measurements for the first job execution and only reports the measurements from the second time the job is run.
- #3 by lazalong on October 22nd, 2019 · Reply
  
  Knowing the timing without the warm-up is good in case we can put this phase in an initialisation. But how costly is it if we can’t do this and have the warm-up phase ‘each time’ ?
  
  Shouldn’t we have measures with and without warm-up cost for a true comparison?
  - #4 by jackson on October 24th, 2019 · Reply
    
    I don’t expect that many games will care about the warmup cost since it only occurs on the first run and is relatively small, but if you want to measure it just replace i < 2 with i < 1 in the test and run it on your target hardware.

#5 by X on April 22nd, 2019 · Reply

Does the Burst dependency come from UPM? Can it be used by code built into a managed .NET DLL?

#6 by jackson on April 22nd, 2019 · Reply

Yes, Burst 1.0.0 is now available in the Unity Package Manager. It supports code in a .NET DLL just like code directly in the project because it operates on IL, not C# source code.
- #7 by X on April 23rd, 2019 · Reply
  
  Given that it works on IL, it does seem possible to use the Burst-related attributes and classes in a .NET library project, but the part that I don’t see a solution for is how a VS project can depend on Burst since it’s a dependency from UPM rather than types coming from UnityEngine.dll…
  
  I suppose this is a general problem with UPM dependencies, but as someone who builds their code into a reusable DLL rather than dropping it all into Unity, it would be awesome to have a clean way to use these Burst types in a managed assembly.
  - #8 by jackson on April 24th, 2019 · Reply
    
    The Burst package includes an Assembly Definition file so it’s compiled into its own DLL. You can find it here:
    
    /path/to/unity/project/Library/ScriptAssemblies/Unity.Burst.dll
    
    You just need to reference that in your DLL project and you should be good to go.

#9 by Jesse TG on May 8th, 2019 · Reply

Thanks for the article (and for your others). Does Burst support jobs of type `IJobParallelFor`?

#10 by jackson on May 8th, 2019 · Reply

Yes.
- #11 by Jesse TG on May 9th, 2019 · Reply
  
  Sweet, thank you.

#12 by Jesse TG on May 9th, 2019 · Reply

New question. Inside a `IJobParallelFor` job, each iteration (or at least each thread) needs a data structure to keep track of intermediate results. What would you suggest?

#13 by jackson on May 9th, 2019 · Reply

That entirely depends on the data you’re looking to store and use. NativeArray<T> is the most common type, but there are many more in the Unity.Collections package as well as my own NativeCollections repo.
- #14 by Jesse TG on May 14th, 2019 · Reply
  
  I wound up just using a local stackalloc array of bools, but thank you.

Free Performance with Burst

Comments