With a bit of understanding and some C# trickery, we can exploit how float works to cram in a few more bits and make some big performance gains. Today we’ll see how to steal some of the bits from a float!

The IEEE 754 standard describes how float values are stored. Here’s how it looks:

  • 1 bit: Sign
  • 8 bits: Exponent
  • 23 bits: Fraction

The Wikipedia page goes way more detail, if you’re interested. For now, we’ll use this knowledge to exploit the format and steal some bits.

Here’s the idea. There are a lot of fraction bits: 23. How often do we really need all that precision? For example, if we’re just storing an angle, do we need precision to 0.00000011920929 degrees? Usually not. So what if we stored something else in the least-significant fraction bits?

We can do this in C# pretty easily using a union. We can even measure the impact it’ll have on our precision. To do so, we’ll start with the float for 1.0:

Sign Exponent Fraction
0 01111111 00000000000000000000000

Then we’ll change more and most of the least-significant digits to 1:

Sign Exponent Fraction
0 01111111 00000000000000000000001
0 01111111 00000000000000000000011
0 01111111 00000000000000000000111
0 01111111 00000000000000000001111

To see the result, here’s a tiny script that changes the bits starting at none and gradually changing them all:

using System;
using System.Runtime.InteropServices;
using System.Text;
using UnityEngine;
 
[StructLayout(LayoutKind.Explicit)]
struct UintFloatUnion
{
    [FieldOffset(0)] public uint Uint;
    [FieldOffset(0)] public float Float;
}
 
class TestScript : MonoBehaviour
{
    void Start()
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i <= 23; ++i)
        {
            uint u = 0b00111111100000000000000000000000; // == 1.0
            u |= (1u << i) - 1; // Change the least significant 'i' bits to 1
 
            // Reinterpret as float
            UintFloatUnion ufu;
            ufu.Float = 0;
            ufu.Uint = u;
            float f = ufu.Float;
 
            if (i < 10)
            {
                sb.Append(' ');
            }
            sb.Append(i);
            sb.Append(": ");
            sb.Append(f.ToString("0." + new string('0', 30)));
            sb.Append('n');
        }
        print(sb.ToString());
    }
}

When we run this, we get these results:

 0: 1.000000000000000000000000000000
 1: 1.000000000000000000000000000000
 2: 1.000000000000000000000000000000
 3: 1.000001000000000000000000000000
 4: 1.000002000000000000000000000000
 5: 1.000004000000000000000000000000
 6: 1.000008000000000000000000000000
 7: 1.000015000000000000000000000000
 8: 1.000030000000000000000000000000
 9: 1.000061000000000000000000000000
10: 1.000122000000000000000000000000
11: 1.000244000000000000000000000000
12: 1.000488000000000000000000000000
13: 1.000976000000000000000000000000
14: 1.001953000000000000000000000000
15: 1.003906000000000000000000000000
16: 1.007812000000000000000000000000
17: 1.015625000000000000000000000000
18: 1.031250000000000000000000000000
19: 1.062500000000000000000000000000
20: 1.125000000000000000000000000000
21: 1.250000000000000000000000000000
22: 1.500000000000000000000000000000
23: 2.000000000000000000000000000000

Unfortunately, float.ToString isn’t giving us enough precision to gauge the impact. Most of the digits are simply 0. So let’s re-write this script in a tiny C++ command line program:

#include <iostream>
#include <iomanip>
 
int main()
{
    std::cout.precision(30);
    for (unsigned int i = 0; i <= 23; ++i)
    {
        unsigned int u = 0b00111111100000000000000000000000; // == 1.0
        u |= (1u << i) - 1; // Change the least significant 'i' bits to 1
        float f = *(float*)(&u); // Reinterpret as float
        std::cout << std::setw(2) << i << ": " << std::fixed << f << 'n';
    }
}

Now here’s what we get:

 0: 1.000000000000000000000000000000
 1: 1.000000119209289550781250000000
 2: 1.000000357627868652343750000000
 3: 1.000000834465026855468750000000
 4: 1.000001788139343261718750000000
 5: 1.000003695487976074218750000000
 6: 1.000007510185241699218750000000
 7: 1.000015139579772949218750000000
 8: 1.000030398368835449218750000000
 9: 1.000060915946960449218750000000
10: 1.000121951103210449218750000000
11: 1.000244021415710449218750000000
12: 1.000488162040710449218750000000
13: 1.000976443290710449218750000000
14: 1.001953005790710449218750000000
15: 1.003906130790710449218750000000
16: 1.007812380790710449218750000000
17: 1.015624880790710449218750000000
18: 1.031249880790710449218750000000
19: 1.062499880790710449218750000000
20: 1.124999880790710449218750000000
21: 1.249999880790710449218750000000
22: 1.499999880790710449218750000000
23: 1.999999880790710449218750000000

That’s a lot more precision! We can now see the impact of using more and more of the digits.

Exactly how many digits to use is really dependent on the data. If the float is just going to translate to a percentage with a couple of decimal points, it’s possible to steal about 8-9 bits. Nearly all float values can tolerate stealing 1-3 bits, which is good enough to store a boolean value or a small integer. However, in some cases there’s a need for incredible accuracy and this technique isn’t useful at all.

There’s no way we’re going to do tons of bit twiddling just to use float values, so we’ll need some utility functions to do it for us. Here are the three we’re going to crete:

  • ReadStolen: Read the integer value stored in the least-signficant bits
  • ReadFloatCleared: Read the float with the value stored in the least-signficant bits cleared to zero
  • Write Write a new float value without changing the value stored in the least-signficant bits

Also, note that the float can be used directly for maximum performance but any 1 bits set in the least-signficant bits will increase its value.

Here’s how to use these functions:

// Read the integer value stored in the least-signficant 3 bits
float myVal = ...;
uint readVal = FloatBitStealing.ReadStolen(myVal, 3);
 
// Read the float value with the least-signficant 3 bits set to zero
float myVal = ...;
float myValCleared = FloatBitStealing.ReadFloatCleared(myVal, 3);
 
// Write srcVal to destVal without changing the least-significant 3 bits
float srcVal = ...;
float destVal = ...;
float writtenVal = FloatBitStealing.Write(srcVal, destVal, 3);

Here’s the source code for FloatBitStealing:

/// <summary>
/// Utility functions for working with <see cref="float"/> values that have some
/// of their least-signficant bits "stolen" by storing another value there.
/// </summary>
///
/// <author>
/// Jackson Dunstan, https://JacksonDunstan.com/articles/5365
/// </author>
///
/// <license>
/// MIT
/// </license>
public static class FloatBitStealing
{
    /// <summary>
    /// Utility to change the type of a <see cref="float"/> to a
    /// <see cref="uint"/> or visa versa.
    /// </summary>
    [System.Runtime.InteropServices.StructLayout(
        System.Runtime.InteropServices.LayoutKind.Explicit)]
    private struct UintFloatUnion
    {
        /// <summary>
        /// <see cref="uint"/> value of the memory
        /// </summary>
        [System.Runtime.InteropServices.FieldOffset(0)] public uint Uint;
 
        /// <summary>
        /// <see cref="float"/> value of the memory
        /// </summary>
        [System.Runtime.InteropServices.FieldOffset(0)] public float Float;
    }
 
    /// <summary>
    /// Read the value with the least-signficant bits cleared to zero
    /// </summary>
    /// 
    /// <param name="f">
    /// Value to read from
    /// </param>
    /// 
    /// <param name="numStolenBits">
    /// Number of least-signficant bits to clear to zero
    /// </param>
    /// 
    /// <returns>
    /// The given value with the given number of least-signficant bits cleared
    /// to zero
    /// </returns>
    public static float ReadFloatCleared(float f, int numStolenBits)
    {
        // 1s for all stolen bits
        uint stolenMask = (1u << numStolenBits) - 1;
 
        UintFloatUnion ufu;
        ufu.Uint = 0;
        ufu.Float = f;
        ufu.Uint = ufu.Uint & ~stolenMask;
        return ufu.Float;
    }
 
    /// <summary>
    /// Read the least-signficant bits
    /// </summary>
    /// 
    /// <param name="f">
    /// Value to read from
    /// </param>
    /// 
    /// <param name="numStolenBits">
    /// Number of least-signficant bits to read
    /// </param>
    /// 
    /// <returns>
    /// The given number of least-signficant bits of the given value
    /// </returns>
    public static uint ReadStolen(float f, int numStolenBits)
    {
        // 1s for all stolen bits
        uint stolenMask = (1u << numStolenBits) - 1;
 
        UintFloatUnion ufu;
        ufu.Uint = 0;
        ufu.Float = f;
        return ufu.Uint & stolenMask;
    }
 
    /// <summary>
    /// Write a value to an existing value without overwriting its
    /// least-signficant bits
    /// </summary>
    /// 
    /// <param name="src">
    /// Value to write
    /// </param>
    /// 
    /// <param name="dest">
    /// Value to write to
    /// </param>
    /// 
    /// <param name="numStolenBits">
    /// Number of least-signficant bits to not overwrite
    /// </param>
    /// 
    /// <returns>
    /// The most-signficant <paramref name="numStolenBits"/> bits of the given
    /// <paramref name="src"/> value combined with the least-signficant bits of
    /// the given <paramref name="dest"/> value.
    /// </returns>
    public static float Write(float src, float dest, int numStolenBits)
    {
        // 1s for all stolen bits
        uint stolenMask = (1u << numStolenBits) - 1;
 
        // Get the not-stolen bits of 'src', which are the most-signficant bits
        UintFloatUnion ufu;
        ufu.Uint = 0;
        ufu.Float = src;
        uint srcNotStolen = ufu.Uint & ~stolenMask;
 
        // Get the stolen bits of 'dest', which are the least-significant bits
        ufu.Uint = 0;
        ufu.Float = dest;
        uint destStolen = ufu.Uint & stolenMask;
 
        // Combine them
        ufu.Uint = srcNotStolen | destStolen;
        return ufu.Float;
    }
}

This technique and the above utility functions are particularly useful to fit the most float values in the CPU’s data cache. For example, a struct with a float and a bool takes up 8 bytes and so only 8 of them will fit in a 64 byte cache line. Alternatively, stealing just one bit of the float with this technique means that there’s no need for the bool and suddenly twice as many values will fit in a cache line. So keep this technique in mind next time you’re trying to fit more data in a cache line and have a float that doesn’t necessarily need all of its precision.