JacksonDunstan.com

Today’s article is about a new native collection type: NativeChunkedList<T>. This type is great when you need a dynamically-resizable array that’s fast to add to and doesn’t waste a lot of memory. Read on to see how it’s implemented, see the performance report, and get the source code.

NativeList

NativeList<T> is Unity’s native version of the C# List<T> type: a dynamically-resizable array. It’s part of the com.unity.collections package that comes with Unity’s ECS, which is still in Preview as of Unity 2018.2. This type stores elements in a single native/unmanaged array, just like NativeArray<T> does. That array has a capacity which is equal to or greater than the number of elements stored, known as the length. When an element is added and there is spare capacity, it’s simply added to the end. If there’s no more spare capacity, a new array with twice the capacity is allocated, all the elements are copied over, the old array is freed, and the element is added to the end.

The performance characteristics of this approach are usually quite good. Accessing the list is just as fast as accessing a NativeArray<T> because it is represented exactly the same: just one array. However, growing the list is quite expensive as it requires copying the whole list to a new array. Likewise, the doubling of capacity can result in quite a lot of waste. For example, if an element is added to a full list with 1024 capacity then the new capacity will be 2048 even though room for only 1025 elements is needed.

Chunked List Approach

Today’s approach aims to solve these issues with NativeList<T>. Like how Unity’s ECS stores components, NativeChunkedList<T> stores its components in “chunks.” Each chunk is simply an array of elements. It then also stores an array of pointers to the chunks.

Here’s how the memory is laid out:

NativeChunkedList Data Layout

This layout allows us to traverse the list in a reasonably cache-friendly way. We’ll access sequential memory as we traverse a chunk, just like with NativeList<T> and NativeArray<T>. When we get to the end of a chunk, we might miss the CPU cache but then we’ll be accessing sequential memory again as we traverse the new chunk. The chunk pointers are also sequentially laid out in an array, so traversing them is cache-friendly too.

Random access is also easy and fast. All we need to do is compute the chunk pointer index (i / numElementsPerChunk) and then use it to index into the chunk (i % numElementsPerChunk).

Adding to a NativeChunkedList<T> is similar to NativeList<T>, but with a critical difference. If the last chunk has spare capacity, we just add the element at the end of it. If it doesn’t, we allocate a new chunk and add the element at the beginning of it. If there’s no spare capacity in the array of pointers, we allocate a new array, copy the old array over, and free the old array. This is just like the backing array of NativeList<T>, but much faster because each element is only a pointer and there are relatively few pointers since each one allows for a whole chunk of elements to be stored.

Several more operations are supported, but their implementations aren’t particularly noteworthy as nearly everything can be implemented just like with any array-backed type like NativeList<T> or even List<T>. In fact, NativeChunkedList<T> has many of the same methods and properties as NativeList<T> and List<T> so that it’s easy to use for those familiar with those types. Here’s the full API:

struct NativeChunkedList<T>
    : IEnumerable<T>
    , IEnumerable
    , IDisposable
    where T : struct
{
    NativeChunkedList(int numElementsPerChunk, int capacity, Allocator allocator);
    void Add(T element)
    void AddRange(NativeArray<T> array, int startIndex, int length);
    void AddRange(T[] array, int startIndex, int length);
    int Capacity { get; set; }
    void Clear();
    void CopyFrom(NativeArray<T> array, int startIndex, int length);
    void CopyFrom(T[] array, int startIndex, int length);
    void Dispose();
    Enumerator GetEnumerator();
    IEnumerator GetEnumerator();
    IEnumerator<T> IEnumerable<T>.GetEnumerator();
    void Insert(int index, T element);
    bool IsCreated { get; }
    int Length { get; }
    T this[int index] { get; set; }
    void RemoveAt(int index);
    void RemoveAtSwapBack(int index);
    T[] ToArray(
        T[] array = null,
        int startIndex = 0,
        int destIndex = 0,
        int length = -1);
    NativeArray<T> ToNativeArray(
        NativeArray<T> array = default(NativeArray<T>),
        int startIndex = 0,
        int destIndex = 0,
        int length = -1);
}

Performance Testing

Now let’s pit NativeChunkedList<T> against NativeArray<T>, NativeList<T>, and NativeLinkedList<T> to see how it performs on a variety of operations. These tests were conducted with 10000 int elements in each collection. NativeChunkedList<T> was configured with a chunk size of 1024 bytes which equals 256 elements per chunk of int values.

The following operations were tested for each collection that supports them:

Add: Starting from an empty collection, append an element to the end
Iterate (Single): Iterate over the full collection in an IJob


Iterate (ParallelFor): Iterate over the full collection in an IJobParallelFor

Insert: Starting from an empty collection, insert elements into the middle
Remove: Starting from a full collection, remove elements from the middle


I ran the performance tests in this environment:

2.7 Ghz Intel Core i7-6820HQ
macOS 10.14
Unity 2018.2.16f1
com.unity.collections package 0.0.9-preview.10
macOS Standalone
.NET 4.x scripting runtime version and API compatibility level
IL2CPP
Non-development
640Ã—480, Fastest, Windowed

Here are the results I got:



Operation
Job Type
NativeArray
NativeList
NativeLinkedList
NativeChunkedList




Add
Single
n/a
1474
3671
906


Iterate
Single
172
189
190
483


Iterate
ParallelFor
204
n/a
173
467


Insert
Single
n/a
n/a
904
1117774


Remove
Single
n/a
n/a
977
1763470








We can see here that we've succeeded in our goal to make Add faster. It takes 39% less time on average to add an element to a NativeChunkedList<T> than it does to add one to a NativeList<T> and 75% less time than NativeLinkedList<T>. This is because, unlike those other types, there are no whole-list copies when capacity runs out. For the 10000 elements in this test, NativeList<T> would have resized from a capacity of 1 then to 2 then 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, and finally 16384. That's a lot of copies!
Iteration, either in an IJob or an IJobParallelFor, takes 2-3x more time than with the other collection types. This is due to the cache misses when going from one chunk to another, the double array indexing required by the chunks array, and the divide and modulus instructions. There's simply no way to beat the other types' single array layout. It should be noted that NativeLinkedList<T> has been pre-sorted so the nodes are sequential. If they were randomized, performance would suffer greatly and NativeChunkedList<T> would win out against it. It should also be noted that NativeChunkedList<T> is usable from an IJobParallelFor where NativeList<T> is not.
The Insert and Remove operations, only supported by NativeLinkedList<T> and NativeChunkedList<T>, shows the usual weakness of array-based data structures. These require shifting all elements after the insertion or removal point which is hugely more expensive than the linked list approach of simply fixing up the Next and Previous pointers. Also of note is that NativeList<T> and NativeChunkedList<T> feature a RemoveAtSwapBack method which avoids all the element shifting by copying the last element to the removal point. This changes the order of the array, but if order isn't important than this is far faster.
Memory Consumption
Our other goal is to reduce the waste of memory incurred by excess capacity in NativeList<T>. The final capacity of its array in the above test is 16384 which, at four bytes per int element, is 65536 bytes of memory usage. Each chunk of the NativeChunkedList<T> was 1024 bytes and held 256 elements, so there were 40 chunks when the list was full. That's 40960 bytes of memory usage for the chunks. The chunk pointers array was also present with a capacity of 64, the internal minimum. That's 256 bytes on 32-bit platforms and 512 bytes on 64-bit platforms. Let's assume 64-bit, so the total usage of the NativeChunkedList<T> was 41472. That means it saved of 24064 bytes or 37%. For larger-sized elements such as Quaternion, the gap would grow 4x wider.
In this example, the NativeChunkedList<T> was 98% full while the NativeList<T> was only 61% full. This is despite the last chunk being only 6% full, which shows how even in the worst-case scenario memory waste is kept under control. On the other hand, NativeArray<T> is sized perfectly for the data since it doesn't dynamically resize. It used 100% of its 10000 elements for a total of 40000 bytes. Compared to it, NativeChunkedList<T> used 1472 more bytes for an increase of 4%. Compared to this best-case scenario, NativeChunkedList<T> turns in a great result.
Usage
Usage of NativeChunkedList<T> is simple. It's just like using a NativeArray<T>, NativeList<T>, or NativeLinkedList<T>. Here are a few quick examples:


// Create an empty list with capacity for 4096 elements in 1024 element chunks
NativeChunkedList<int> list = new NativeChunkedList<int>(
    1024,
    4096,
    Allocator.Temp);
 
// Add some elements to the list
list.Add(10);
list.Add(20);
list.Add(30);
list.Add(40);
list.Add(50);
 
// Iterate over the list
foreach (int val in list)
{
    Debug.Log(val);
}
 
// Remove an element from the middle
list.RemoveAt(2);
 
// Insert an element into the middle
list.Insert(1, 15);
 
// Access the elements sequentially
for (int i = 0; i < list.Length; ++i)
{
    Debug.Log(list[i]);
}
 
// Dispose the list's native memory
list.Dispose();


Conclusion
With NativeChunkedList<T>, we've achieved the two goals we set forth at the outset. We've got faster calls to Add with a 39% time reduction. We've also got a 37% reduction in memory usage that's now only 4% over the theoretical lowest value. While not explicit goals, we even got a pretty good suite of functionality including support for IJobParallelFor. To use the type in your own projects, check out the source code, or run the performance or unit tests, check out the NativeCollection GitHub project.
Continue to part two.

Operation	Job Type	NativeArray	NativeList	NativeLinkedList	NativeChunkedList
Add	Single	n/a	1474	3671	906
Iterate	Single	172	189	190	483
Iterate	ParallelFor	204	n/a	173	467
Insert	Single	n/a	n/a	904	1117774
Remove	Single	n/a	n/a	977	1763470

#1 by Jes on November 20th, 2018 · Reply

Cool Idea :)

May it can be faster in iteration using chunked iteration like in last unity ECS
Get chunk one by one and fast iterate inside chunks

May be it can be done inside enumerator too :)

#2 by jackson on November 21st, 2018 · Reply

Hi Jes,

Iteration is definitely lacking with the current approach, as you can see from the performance test results and analysis in the article. Stay tuned for part two, which will include your suggestion for faster iteration.

NativeChunkedList<T>

NativeList

Chunked List Approach

Performance Testing

Memory Consumption

Usage

Conclusion

Comments