Today’s article is about a new native collection type: NativeChunkedList<T>. This type is great when you need a dynamically-resizable array that’s fast to add to and doesn’t waste a lot of memory. Read on to see how it’s implemented, see the performance report, and get the source code.

NativeList

NativeList<T> is Unity’s native version of the C# List<T> type: a dynamically-resizable array. It’s part of the com.unity.collections package that comes with Unity’s ECS, which is still in Preview as of Unity 2018.2. This type stores elements in a single native/unmanaged array, just like NativeArray<T> does. That array has a capacity which is equal to or greater than the number of elements stored, known as the length. When an element is added and there is spare capacity, it’s simply added to the end. If there’s no more spare capacity, a new array with twice the capacity is allocated, all the elements are copied over, the old array is freed, and the element is added to the end.

The performance characteristics of this approach are usually quite good. Accessing the list is just as fast as accessing a NativeArray<T> because it is represented exactly the same: just one array. However, growing the list is quite expensive as it requires copying the whole list to a new array. Likewise, the doubling of capacity can result in quite a lot of waste. For example, if an element is added to a full list with 1024 capacity then the new capacity will be 2048 even though room for only 1025 elements is needed.

Chunked List Approach

Today’s approach aims to solve these issues with NativeList<T>. Like how Unity’s ECS stores components, NativeChunkedList<T> stores its components in “chunks.” Each chunk is simply an array of elements. It then also stores an array of pointers to the chunks.

Here’s how the memory is laid out:

NativeChunkedList Data Layout

This layout allows us to traverse the list in a reasonably cache-friendly way. We’ll access sequential memory as we traverse a chunk, just like with NativeList<T> and NativeArray<T>. When we get to the end of a chunk, we might miss the CPU cache but then we’ll be accessing sequential memory again as we traverse the new chunk. The chunk pointers are also sequentially laid out in an array, so traversing them is cache-friendly too.

Random access is also easy and fast. All we need to do is compute the chunk pointer index (i / numElementsPerChunk) and then use it to index into the chunk (i % numElementsPerChunk).

Adding to a NativeChunkedList<T> is similar to NativeList<T>, but with a critical difference. If the last chunk has spare capacity, we just add the element at the end of it. If it doesn’t, we allocate a new chunk and add the element at the beginning of it. If there’s no spare capacity in the array of pointers, we allocate a new array, copy the old array over, and free the old array. This is just like the backing array of NativeList<T>, but much faster because each element is only a pointer and there are relatively few pointers since each one allows for a whole chunk of elements to be stored.

Several more operations are supported, but their implementations aren’t particularly noteworthy as nearly everything can be implemented just like with any array-backed type like NativeList<T> or even List<T>. In fact, NativeChunkedList<T> has many of the same methods and properties as NativeList<T> and List<T> so that it’s easy to use for those familiar with those types. Here’s the full API:

struct NativeChunkedList<T>
    : IEnumerable<T>
    , IEnumerable
    , IDisposable
    where T : struct
{
    NativeChunkedList(int numElementsPerChunk, int capacity, Allocator allocator);
    void Add(T element)
    void AddRange(NativeArray<T> array, int startIndex, int length);
    void AddRange(T[] array, int startIndex, int length);
    int Capacity { get; set; }
    void Clear();
    void CopyFrom(NativeArray<T> array, int startIndex, int length);
    void CopyFrom(T[] array, int startIndex, int length);
    void Dispose();
    Enumerator GetEnumerator();
    IEnumerator GetEnumerator();
    IEnumerator<T> IEnumerable<T>.GetEnumerator();
    void Insert(int index, T element);
    bool IsCreated { get; }
    int Length { get; }
    T this[int index] { get; set; }
    void RemoveAt(int index);
    void RemoveAtSwapBack(int index);
    T[] ToArray(
        T[] array = null,
        int startIndex = 0,
        int destIndex = 0,
        int length = -1);
    NativeArray<T> ToNativeArray(
        NativeArray<T> array = default(NativeArray<T>),
        int startIndex = 0,
        int destIndex = 0,
        int length = -1);
}
Performance Testing

Now let’s pit NativeChunkedList<T> against NativeArray<T>, NativeList<T>, and NativeLinkedList<T> to see how it performs on a variety of operations. These tests were conducted with 10000 int elements in each collection. NativeChunkedList<T> was configured with a chunk size of 1024 bytes which equals 256 elements per chunk of int values.

The following operations were tested for each collection that supports them:

  • Add: Starting from an empty collection, append an element to the end
  • Iterate (Single): Iterate over the full collection in an IJob
  • Iterate (ParallelFor): Iterate over the full collection in an IJobParallelFor
  • Insert: Starting from an empty collection, insert elements into the middle
  • Remove: Starting from a full collection, remove elements from the middle

I ran the performance tests in this environment:

  • 2.7 Ghz Intel Core i7-6820HQ
  • macOS 10.14
  • Unity 2018.2.16f1
  • com.unity.collections package 0.0.9-preview.10
  • macOS Standalone
  • .NET 4.x scripting runtime version and API compatibility level
  • IL2CPP
  • Non-development
  • 640×480, Fastest, Windowed

Here are the results I got:

Operation Job Type NativeArray NativeList NativeLinkedList NativeChunkedList
Add Single n/a 1474 3671 906
Iterate Single 172 189 190 483
Iterate ParallelFor 204 n/a 173 467
Insert Single n/a n/a 904 1117774
Remove Single n/a n/a 977 1763470

NativeChunkedList - Performance Graph for Add

NativeChunkedList - Performance Graph for Iterate (Single)

NativeChunkedList - Performance Graph for Iterate (ParallelFor)

NativeChunkedListInsert

NativeChunkedList - Performance Graph for Remove

We can see here that we've succeeded in our goal to make Add faster. It takes 39% less time on average to add an element to a NativeChunkedList<T> than it does to add one to a NativeList<T> and 75% less time than NativeLinkedList<T>. This is because, unlike those other types, there are no whole-list copies when capacity runs out. For the 10000 elements in this test, NativeList<T> would have resized from a capacity of 1 then to 2 then 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, and finally 16384. That's a lot of copies!

Iteration, either in an IJob or an IJobParallelFor, takes 2-3x more time than with the other collection types. This is due to the cache misses when going from one chunk to another, the double array indexing required by the chunks array, and the divide and modulus instructions. There's simply no way to beat the other types' single array layout. It should be noted that NativeLinkedList<T> has been pre-sorted so the nodes are sequential. If they were randomized, performance would suffer greatly and NativeChunkedList<T> would win out against it. It should also be noted that NativeChunkedList<T> is usable from an IJobParallelFor where NativeList<T> is not.

The Insert and Remove operations, only supported by NativeLinkedList<T> and NativeChunkedList<T>, shows the usual weakness of array-based data structures. These require shifting all elements after the insertion or removal point which is hugely more expensive than the linked list approach of simply fixing up the Next and Previous pointers. Also of note is that NativeList<T> and NativeChunkedList<T> feature a RemoveAtSwapBack method which avoids all the element shifting by copying the last element to the removal point. This changes the order of the array, but if order isn't important than this is far faster.

Memory Consumption

Our other goal is to reduce the waste of memory incurred by excess capacity in NativeList<T>. The final capacity of its array in the above test is 16384 which, at four bytes per int element, is 65536 bytes of memory usage. Each chunk of the NativeChunkedList<T> was 1024 bytes and held 256 elements, so there were 40 chunks when the list was full. That's 40960 bytes of memory usage for the chunks. The chunk pointers array was also present with a capacity of 64, the internal minimum. That's 256 bytes on 32-bit platforms and 512 bytes on 64-bit platforms. Let's assume 64-bit, so the total usage of the NativeChunkedList<T> was 41472. That means it saved of 24064 bytes or 37%. For larger-sized elements such as Quaternion, the gap would grow 4x wider.

In this example, the NativeChunkedList<T> was 98% full while the NativeList<T> was only 61% full. This is despite the last chunk being only 6% full, which shows how even in the worst-case scenario memory waste is kept under control. On the other hand, NativeArray<T> is sized perfectly for the data since it doesn't dynamically resize. It used 100% of its 10000 elements for a total of 40000 bytes. Compared to it, NativeChunkedList<T> used 1472 more bytes for an increase of 4%. Compared to this best-case scenario, NativeChunkedList<T> turns in a great result.

Usage

Usage of NativeChunkedList<T> is simple. It's just like using a NativeArray<T>, NativeList<T>, or NativeLinkedList<T>. Here are a few quick examples:

// Create an empty list with capacity for 4096 elements in 1024 element chunks
NativeChunkedList<int> list = new NativeChunkedList<int>(
    1024,
    4096,
    Allocator.Temp);
 
// Add some elements to the list
list.Add(10);
list.Add(20);
list.Add(30);
list.Add(40);
list.Add(50);
 
// Iterate over the list
foreach (int val in list)
{
    Debug.Log(val);
}
 
// Remove an element from the middle
list.RemoveAt(2);
 
// Insert an element into the middle
list.Insert(1, 15);
 
// Access the elements sequentially
for (int i = 0; i < list.Length; ++i)
{
    Debug.Log(list[i]);
}
 
// Dispose the list's native memory
list.Dispose();
Conclusion

With NativeChunkedList<T>, we've achieved the two goals we set forth at the outset. We've got faster calls to Add with a 39% time reduction. We've also got a 37% reduction in memory usage that's now only 4% over the theoretical lowest value. While not explicit goals, we even got a pretty good suite of functionality including support for IJobParallelFor. To use the type in your own projects, check out the source code, or run the performance or unit tests, check out the NativeCollection GitHub project.

Continue to part two.