Make Math.sqrt Twice As Fast
Today’s article shows you how to double the speed of Math.sqrt
, which should be useful to many Flash apps as it is a very common operation. For example, the difference between two points involves a square root: sqrt(dx*dx+dy*dy)
. Read on to learn how!
For such an important function, I’m surprised there isn’t much out there discussing how to improve its speed. The most informative article on the topic is found at boostworthy.com and lists two alternative functions:
- The “Babylonian” approach: a loop that gets closer and closer to the actual square root
- The “Bakhshali” approach: a lookup table where the keys are the squares of the values
The “Babylonian” approach is very accurate, but the “Bakhshali” approach only works on perfect squares (e.g. bakhshali(25)==5
but bakhshali(10)==NaN
) due to a divide by zero.
In addition to these approaches, I have added a lookup table-based approach based on a modified version of TrigLUT. This LUT
class still supports arbitrary math functions, but has a couple of changes from the TrigLUT
class:
- Uses a
max
parameter instead of2π
to support square roots up tomax
- Removes all cyclical functions, since
Math.sqrt
is not cyclical
Here is the source code for the LUT
class and a performance test to compare the various approaches:
package { /** * A generic lookup table useful for caching the results of a function * @author Jackson Dunstan */ public class LUT { /** Table of function values*/ public var table:Vector.<Number>; /** 10^decimals of precision*/ public var pow:Number; /** * Make the look up table * @param max Maximum value to cache * @param numDigits Number of digits places of precision * @param func Function to call to generate stored values. * Must be valid on [0,max). * @throws Error If func is null or invalid on [0,max) */ public function LUT(numDigits:uint, max:Number, func:Function) { var pow:Number = this.pow = Math.pow(10, numDigits); var round:Number = 1.0 / pow; var len:uint = 1 + max*pow; var table:Vector.<Number> = this.table = new Vector.<Number>(len); var val:Number = 0; for (var i:uint = 0; i < len; ++i) { table[i] = func(val); val += round; } } /** * Look up the value of the given input * @param val Input value to look up the value of * @return The value of the given input */ public function val(val:Number): Number { return this.table[int(val*this.pow)]; } } }
package { import flash.display.*; import flash.sampler.*; import flash.events.*; import flash.utils.*; import flash.text.*; import flash.geom.*; public class FastSqrtTest extends Sprite { private var __logger:TextField = new TextField(); private function log(msg:*): void { __logger.appendText(msg + "\n"); } private var cache:Object; public function FastSqrtTest() { __logger.autoSize = TextFieldAutoSize.LEFT; addChild(__logger); log("Function,Time"); const CACHE_SIZE:int = 10000; this.cache = {}; for (i = 0; i < CACHE_SIZE; ++i) { this.cache[i * i] = i; } const DIGITS:int = 2; var beforeTime:int; var afterTime:int; var i:int; var j:int; var result:Number; const VAL:Number = 25; const REPS:int = 100000; const MAX:int = 100; const cache:Object = this.cache; var lut:LUT = new LUT(DIGITS, MAX, Math.sqrt); var lutTable:Vector.<Number> = lut.table; var lutPow:Number = lut.pow; beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = Math.sqrt(VAL); } } afterTime = getTimer(); log("Math.sqrt," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = babylonian(VAL); } } afterTime = getTimer(); log("Babylonian," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = bakhshali(VAL); } } afterTime = getTimer(); log("Bakhshali," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = VAL * 0.25; var a:Number; do { result = 0.5 * (result + VAL / result); a = result * result - VAL; if (a < 0) a = -a; } while (a > 0.0001) } } afterTime = getTimer(); log("Babylonian (inline)," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { var x:int = int(VAL); var y:int = cache[x]; if (y * y == VAL) { result = y; } else { var d:Number = VAL - x; var p:Number = d / (y << 2); var b:Number = y + p; result = b - p * p / 2 * b; } } } afterTime = getTimer(); log("Bakhshali (inline)," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = lut.val(VAL); } } afterTime = getTimer(); log("LUT," + (afterTime - beforeTime)); beforeTime = getTimer(); for(i = 0; i < REPS; ++i) { for (j = 0; j < MAX; ++j) { result = lutTable[int(VAL*lutPow)]; } } afterTime = getTimer(); log("LUT (inline)," + (afterTime - beforeTime)); } private function babylonian(n:Number): Number { var x:Number = n * 0.25; var a:Number; do { x = 0.5 * (x + n / x); a = x * x - n; if (a < 0) a = -a; } while (a > 0.0001) return x; } private function bakhshali(n:Number): Number { var x:int = int(n); var y:int = cache[x]; if (y * y == n) { return y; } var d:Number = n - x; var p:Number = d / (y << 2); var a:Number = y + p; return a - p * p / 2 * a; } } }
This is the environment I ran the test app on:
- Flex SDK (MXMLC) 4.1.0.16076, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 10.3.181.14
- 2.4 Ghz Intel Core i5
- Mac OS X 10.6.7
And here are the results I got:
Function | Time |
---|---|
Math.sqrt | 219 |
Babylonian | 415 |
Bakhshali | 424 |
Babylonian (inline) | 415 |
Bakhshali (inline) | 354 |
LUT | 163 |
LUT (inline) | 97 |
Here is the same performance test data in graph form:
From these results we can conclude that:
- The “Babylonian” and “Bakhshali” approaches are nearly twice as slow as
Math.sqrt
, even when inlined - The
LUT
class is about 25% faster when used normally and twice as fast when inlined
Speed is useless without accuracy, so let’s have a look at a demo app that plots the square roots from 0 to 100, scaled by a factor of 70 to make it easier to see:
package { import flash.display.*; import flash.filters.*; import flash.events.*; import flash.text.*; import flash.geom.*; [SWF(width=640,height=480,backgroundColor=0xEEEAD9)] public class FastSqrtDemo extends Sprite { private static const TEXT_FORMAT:TextFormat = new TextFormat("_sans", 11); private static const MAX:int = 10*10; private static const APPROX_METHODS:Array = [ ["LUT","lut"], ["Babylonian","babylonian"], ["Bakhshali","bakhshali"] ] private var bmd:BitmapData; private var bmdRect:Rectangle; private var approxFunc:Function; private var cache:Object; private var sqrtLUT:LUT; public function FastSqrtDemo() { stage.scaleMode = StageScaleMode.NO_SCALE; stage.align = StageAlign.TOP_LEFT; const CACHE_SIZE:int = 1+MAX; this.cache = {}; for (var i:int = 0; i < CACHE_SIZE; ++i) { this.cache[i * i] = i; } this.sqrtLUT = new LUT(2, CACHE_SIZE, Math.sqrt); var x:Number; var y:Number = 0; var prompt:TextField = new TextField(); prompt.y = y; prompt.defaultTextFormat = TEXT_FORMAT; prompt.htmlText = "<b>Approximation Method:</b>"; prompt.autoSize = TextFieldAutoSize.LEFT; prompt.selectable = false; addChild(prompt); const PAD:Number = 3; const demo:FastSqrtDemo = this; var buttons:Array = []; y += prompt.height + PAD; for each (var category:Array in APPROX_METHODS) { var tf:TextField = new TextField(); tf.mouseEnabled = false; tf.defaultTextFormat = TEXT_FORMAT; tf.text = category[0]; tf.autoSize = TextFieldAutoSize.LEFT; tf.selectable = false; tf.x = tf.y = PAD; var button:Sprite = new Sprite(); button.buttonMode = true; button.name = category[1]; button.graphics.beginFill(0xE6E2D1); button.graphics.drawRect(0, 0, tf.width+PAD*2, tf.height+PAD*2); button.graphics.endFill(); button.addChild(tf); button.addEventListener( MouseEvent.CLICK, function(ev:MouseEvent): void { var clickButton:Sprite = ev.currentTarget as Sprite; for each (var otherButton:Sprite in buttons) { otherButton.filters = [new BevelFilter()]; otherButton.buttonMode = true; otherButton.mouseEnabled = true; } clickButton.filters = []; clickButton.buttonMode = false; clickButton.mouseEnabled = false; approxFunc = demo[clickButton.name]; redraw(); } ); button.x = PAD; button.y = y; buttons.push(button); addChild(button); y += button.height + PAD; } var about:TextField = new TextField(); about.defaultTextFormat = TEXT_FORMAT; about.y = y; about.htmlText = "Black line: Math.sqrt\n" + "Red line: approximation\n" + "Demo by <font color=\"#0071BB\"><a href=\"" + "http://jacksondunstan.com/articles/1217" + "\">JacksonDunstan.com</a></font>"; about.autoSize = TextFieldAutoSize.LEFT; about.selectable = false; addChild(about); this.bmd = new BitmapData(stage.stageWidth, stage.stageHeight, false, 0xffffffff); this.bmdRect = new Rectangle(0, 0, this.bmd.width, this.bmd.height); addChildAt(new Bitmap(bmd), 0); buttons[0].dispatchEvent(new MouseEvent(MouseEvent.CLICK)); } private function redraw(): void { var bmd:BitmapData = this.bmd; bmd.lock(); var w:int = bmd.width; var h:int = bmd.height; bmd.fillRect(this.bmdRect, 0xEEEAD9); var halfH:int = h / 2; var stepVal:Number = MAX / Number(w); var val:Number = 0; const SCALE:Number = 70; for (var x:int = 0; x < w; ++x) { bmd.setPixel( x, h - 1 - Math.sqrt(val)*SCALE, 0xff000000 ); bmd.setPixel( x, h - 1 - (this.approxFunc(val) || 0)*SCALE, 0xffff0000 ); val += stepVal; } bmd.unlock(); } private function lut(x:Number): Number { return sqrtLUT.val(x); } private function babylonian(n:Number): Number { var x:Number = n * 0.25; var a:Number; do { x = 0.5 * (x + n / x); a = x * x - n; if (a < 0) a = -a; } while (a > 0.0001) return x; } private function bakhshali(n:Number): Number { var x:int = int(n); var y:int = cache[x]; if (y * y == n) { return y; } var d:Number = n - x; var p:Number = d / (y << 2); var a:Number = y + p; return a - p * p / 2 * a; } } }
This demo makes it plain that the LUT
and “Babylonian” methods are very accurate, but the “Bakhshali” method is useless (returning NaN
) everywhere but the perfect squares.
In conclusion, an inlined version of the LUT
class is twice as fast as Math.sqrt
. It’s only one line of code to inline and doesn’t use very much memory, so why not double your square root speed?
Know of any more ways to speed up square roots? Post a comment below!
#1 by makc on May 30th, 2011
iterative methods such as “babylonian” take more time with extra precision; did you make sure that precision is same as Math.sqrt in your test?
#2 by jackson on May 30th, 2011
You have a good point that the iterative approach can vary the level of accuracy. I didn’t test any levels of accuracy other than the level specified in the original boostworthy.com article. Still, I did mention at the beginning of the article that the “Babylonian” approach is “very accurate” and at the end of the article there is a demo app that graphs it compared to
Math.sqrt
with a very high level of accuracy. If you change thewhile
loop to0.1
, the accuracy degrades a little but the performance improves significantly. Here are the results I get on the same machine as in the article:#3 by Valentin on May 30th, 2011
Would be good to mention that most of the time you don’t need square root at all. For most checks you can use squared length i.e. x*x+y*y.
#4 by jackson on May 30th, 2011
That’s true. For example, if you’re sorting by distance you only need the distance squared. The article discusses the case where you really do need the actual square root.
#5 by Henke37 on May 30th, 2011
You forgot the fast squareroot trick. It’s not just for inverse square roots (and even if it was, it would be easy to invert it a second time).
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Approximations_that_depend_on_IEEE_representation
#6 by jackson on May 30th, 2011
Any idea how you’d implement such a trick in AS3?
#7 by benjamin guihaire on October 25th, 2013
yes, using byteArray combined with alchemy for fast memory read & write. see this post : http://guihaire.com/code/?p=484
#8 by jonathan on May 30th, 2011
Have you heard about “Carmack’s Inverse Square Root” optimization ?
here : http://www.codemaestro.com/reviews/9
and here for Haxe execution : http://ncannasse.fr/blog/fast_inverse_square_root
impossible in as3 without Alchemy opcode.
#9 by jackson on May 30th, 2011
Perhaps some day I’ll implement this with Alchemy opcodes. Thanks for the tip.
#10 by bkogut on May 31st, 2011
nitpicking : actually, the real author is yet to be known
beyond3d.com – Origin of Quake3’s Fast InvSqrt()
http://www.beyond3d.com/content/articles/8/
#11 by pleclech on May 30th, 2011
There is also math functions implemented into Apparat inline or macro :
http://code.google.com/p/apparat/source/browse/apparat-ersatz/src/main/as3/apparat/math/FastMath.as
and
http://code.google.com/p/apparat/source/browse/apparat-ersatz/src/main/as3/apparat/math/MathMacro.as
Of course some of these are approximations so you ‘ll not get full precision.
Patrick.
#12 by skyboy on May 31st, 2011
Using these optimizations on the babylon function, I noticed a performance jump of 20-50% on my machine and a few others.
Due to my old hardware having a slow FPU, Math.sqrt is tremendously faster (75% faster than inline), however, on newer machines, I expect this gap to be narrower.
#13 by skyboy on May 31st, 2011
My bad, It I mis-pasted the code:
#14 by skyboy on May 31st, 2011
Wait, no. It’s the operators being swallowed by the anti-HTML thing.
#15 by jackson on May 31st, 2011
Interesting optimization. I think I got it cleaned up, but I might have errors in there. Let me know if I do. For example, why do you have
1e0
and not just1
? I tested it anyhow and it seems slow but very accurate:Perhaps this is due to the faster FPU in this Core i5.
#16 by skyboy on June 9th, 2011
I have 1e0 instead of 1 to be absolutely certain a double is compiled instead of int; I recall 1.0 occasionally being optimized to 1.
I also played with removing the loop entirely and running through 10-11 multiplications in a row, but on both that and the version above, I noticed lower numbers, or perfect squares taking longer due to them requiring a lower number of passes by smaller loops. The conditional-less version isn’t 100% accurate all the time, but it works as accurately as square root up to about 2 million (or 25 million. I can’t recall how many decimal places I had on the test number).
#17 by jpauclair on June 1st, 2011
I’de go with:
and play with 0.01 or 0.02 for better approximation
#18 by jpauclair on June 1st, 2011
Or directly in your test case: