Stage3D Readback Performance
At long last, Flash Player 11 has been released and carries with it a raft of exciting new features. Perhaps most exciting is the inclusion of the new Stage3D
class (and related libraries) to enable GPU-accelerated graphics rendering. Today’s article will be the first to cover this new API and discusses one of its features: reading back the rendered scene into a BitmapData
that you can put on the regular Stage
. Surely this will be a popular operation for merging 3D and 2D, so let’s see how fast it is!
If hardware acceleration is being used, the pixels will need to be sent back from video card memory (VRAM) into main system memory (RAM), which can be a very expensive operation. If the software renderer is being used instead of hardware acceleration, the pixels will already be in RAM so the transfer will be—theoretically—a much quicker memory copy operation.
To test this theory, I wrote a little performance app. It draws absolutely nothing with the Stage3D
API and only displays a little UI for controlling the app. This way, we can isolate the performance of Context3D.drawToBitmapData
, which is responsible for reading the Stage3D
‘s pixels into a BitmapData
.
package { import flash.display3D.*; import flash.external.*; import flash.display.*; import flash.sampler.*; import flash.system.*; import flash.events.*; import flash.utils.*; import flash.text.*; import flash.geom.*; import com.adobe.utils.*; [SWF(width=640,height=480,backgroundColor=0xEEEAD9)] public class Stage3DReadback extends Sprite { private static const PAD:Number = 3; private static const TEXT_FORMAT:TextFormat = new TextFormat("_sans", 11); private var __stage3D:Stage3D; private var __tf:TextField = new TextField(); private var __context:Context3D; private var __bmdAlpha:BitmapData; private var __bmdNoAlpha:BitmapData; private var __mode:String; private var __enterFrameHandler:Function; private var __driverInfo:String; public function Stage3DReadback() { stage.align = StageAlign.TOP_LEFT; stage.scaleMode = StageScaleMode.NO_SCALE; __stage3D = stage.stage3Ds[0]; makeButton("Toggle Hardware", onToggleHardware); makeButton("No Readback", onNoReadback); makeButton("Readback (no alpha)", onReadbackNoAlpha); makeButton("Readback (alpha)", onReadbackAlpha); var about:TextField = new TextField(); about.autoSize = TextFieldAutoSize.LEFT; about.defaultTextFormat = TEXT_FORMAT; about.htmlText = '<font color="#0071BB">' + '<a href="http://JacksonDunstan.com/articles/1446">' + 'JacksonDunstan.com' + '</a></font>\n' + 'October 2011'; about.x = stage.stageWidth - PAD - about.width; about.y = PAD; addChild(about); var logger:TextField = __tf; logger.autoSize = TextFieldAutoSize.LEFT; logger.y = this.height; addChild(logger); __mode = "No Readback"; __enterFrameHandler = onEnterFrameNoReadback; setupContext(Context3DRenderMode.AUTO); } private function setupContext(renderMode:String): void { __tf.text = "Setting up context with render mode: " + renderMode; __stage3D.addEventListener(Event.CONTEXT3D_CREATE, onContextCreated); __stage3D.requestContext3D(renderMode); } private function onContextCreated(ev:Event): void { __stage3D.removeEventListener(Event.CONTEXT3D_CREATE, onContextCreated); const width:int = stage.stageWidth; const height:int = stage.stageHeight; __context = __stage3D.context3D; __context.configureBackBuffer(width, height, 0, true); __driverInfo = __context.driverInfo; // First time only if (!__bmdNoAlpha) { __bmdNoAlpha = new BitmapData(width, height, false); __bmdAlpha = new BitmapData(width, height, true); } setMode(__mode, __enterFrameHandler); } private function removeAllEnterFrameHandlers(): void { removeEventListener(Event.ENTER_FRAME, onEnterFrameNoReadback); removeEventListener(Event.ENTER_FRAME, onEnterFrameReadbackNoAlpha); removeEventListener(Event.ENTER_FRAME, onEnterFrameReadbackAlpha); } private function setMode(name:String, enterFrameHandler:Function): void { removeAllEnterFrameHandlers(); __mode = name; __enterFrameHandler = enterFrameHandler; addEventListener(Event.ENTER_FRAME, enterFrameHandler); } private function onToggleHardware(ev:MouseEvent): void { removeAllEnterFrameHandlers(); __context.dispose(); __tf.text = "Toggling hardware..."; setupContext( __driverInfo.toLowerCase().indexOf("software") >= 0 ? Context3DRenderMode.AUTO : Context3DRenderMode.SOFTWARE ); } private function onNoReadback(ev:MouseEvent): void { setMode("No Readback", onEnterFrameNoReadback); } private function onReadbackNoAlpha(ev:MouseEvent): void { setMode("Readback (no alpha)", onEnterFrameReadbackNoAlpha); } private function onReadbackAlpha(ev:MouseEvent): void { setMode("Readback (alpha)", onEnterFrameReadbackAlpha); } private function reportTime(name:String, time:int): void { __tf.text = __driverInfo + " - " + name + ": " + time + " ms"; } private function onEnterFrameNoReadback(ev:Event): void { var beginTime:int = getTimer(); __context.clear(0xEE/255, 0xEA/255, 0xD9/255, 1.0); __context.present(); var endTime:int = getTimer(); var drawTime:int = endTime - beginTime; reportTime("No readback", drawTime); } private function onEnterFrameReadbackNoAlpha(ev:Event): void { var beginTime:int = getTimer(); __context.clear(0xEE/255, 0xEA/255, 0xD9/255, 1.0); __context.drawToBitmapData(__bmdNoAlpha); __context.present(); var endTime:int = getTimer(); var drawTime:int = endTime - beginTime; reportTime("Readback (no alpha)", drawTime); } private function onEnterFrameReadbackAlpha(ev:Event): void { var beginTime:int = getTimer(); __context.clear(0xEE/255, 0xEA/255, 0xD9/255, 1.0); __context.drawToBitmapData(__bmdAlpha); __context.present(); var endTime:int = getTimer(); var drawTime:int = endTime - beginTime; reportTime("Readback (alpha)", drawTime); } private function makeButton(label:String, callback:Function): void { var tf:TextField = new TextField(); tf.defaultTextFormat = TEXT_FORMAT; tf.name = "label"; tf.text = label; tf.autoSize = TextFieldAutoSize.LEFT; tf.selectable = false; tf.x = tf.y = PAD; var button:Sprite = new Sprite(); button.name = label; button.graphics.beginFill(0xE6E2D1); button.graphics.drawRect(0, 0, tf.width+PAD*2, tf.height+PAD*2); button.graphics.endFill(); button.graphics.lineStyle(1, 0x000000); button.graphics.drawRect(0, 0, tf.width+PAD*2, tf.height+PAD*2); button.addChild(tf); button.addEventListener(MouseEvent.CLICK, callback); button.x = PAD + this.width; button.y = PAD; addChild(button); } } }
- Launch Performance Test (VGA resolution = 640×480)
- Launch Performance Test (SVGA resolution = 800×600)
- Launch Performance Test (XGA resolution = 1024×768)
- Launch Performance Test (720p resolution = 1280×720)
- Launch Performance Test (1080p resolution = 1920×1080)
I ran this performance test with the following environment:
- Flex SDK (MXMLC) 4.5.1.21328, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 11.0.1.152
- 2.4 Ghz Intel Core i5
- Mac OS X 10.7.1
- NVIDIA GeForce GT 330M 256 MB
And got these results:
Hardware
Resolution | No Readback | Readback (no alpha) | Readback (alpha) |
---|---|---|---|
640×480 | 0 | 3 | 3 |
800×600 | 0 | 4 | 4 |
1024×768 | 0 | 6 | 6 |
1280×720 | 0 | 8 | 8 |
1920×1080 | 0 | 15 | 15 |
Software
Resolution | No Readback | Readback (no alpha) | Readback (alpha) |
---|---|---|---|
640×480 | 1 | 2 | 2 |
800×600 | 1 | 4 | 4 |
1024×768 | 3 | 6 | 6 |
1280×720 | 3 | 7 | 7 |
1920×1080 | 7 | 15 | 15 |
Software rendering is clearly slower overall, even with a blank scene. Unfortunately, it seems no faster at reading the scene back into the BitmapData
than the hardware-accelerated version. This would have been one of software rendering’s only performance advantages over hardware-accelerated rendering, but it seems as though this optimization is not (yet) in place.
Nonetheless, this test points out an important fact: reading the scene’s pixels back into a BitmapData
is very expensive and possibly not feasible in real time with large scenes. For example, a game attempting to run at a smooth 30 frames-per-second has only 33 milliseconds per frame to do its work. If reading the 3D scene back into RAM takes 15 milliseconds, the rest of the game (e.g. physics, sound, 2D rendering, networking) must be quite fast to accommodate it. Also, it’s a good idea to think of older systems than my test machine, which is a relatively new MacBook Pro. Still, if adding 3D content to a 2D stage scene is very important, it seems like it can be accomplished so long as you limit the resolution of the 3D scene.
Spot a bug? Have a suggestion? Different results on a different OS or video card? Post a comment!
#1 by ben w on October 10th, 2011 ·
as you have done half the hard work already do you think you could do me a favour?
and tell me how fast read back is for a 1×1 pixel bitmapData.
my reason for this is that I used a texture readback to handle mouse interactions with a complex scene, by encoding object information into a colour texture and as you have discovered this can only really be used for debugging due to the fact that it is quite costly :(
BUT, with regards to the mouse in theory I only need to render 1 pixel of the whole screen (the pixel under the mouse) so I can use a teeny tiny frustum to cull away the vast majority of a scene and then render all objects that are in/intersecting that small frustum..
so can you get away with a 1 pixel read back? :D if it comes in sub 1ms then I think it has a use.
ben
#2 by ben w on October 10th, 2011 ·
as you have done half the hard work already do you think you could do me a favour?
and tell me how fast read back is for a 1×1 pixel bitmapData.
my reason for this is that I used a texture readback to handle mouse interactions with a complex scene, by encoding object information into a colour texture and as you have discovered this can only really be used for debugging due to the fact that it is quite costly :(
BUT, with regards to the mouse in theory I only need to render 1 pixel of the whole screen (the pixel under the mouse) so I can use a teeny tiny frustum to cull away the vast majority of a scene and then render all objects that are in/intersecting that small frustum..
so can you get away with a 1 pixel read back? :D if it comes in sub 1ms then I think it has a use.
ben
(might have double posted)
oh and if it is fast (sub 1ms) how many can be done before one hits the 1ms mark
thanks
#3 by jackson on October 10th, 2011 ·
This sounds kind of like using an object buffer: writing color-encoded pixels to a screen-size texture, reading that back, and querying the values (e.g. via
BitmapData.getPixel32
). The problem there is that you double your fill rate: the number of pixels drawn per frame, which is extremely expensive. The bonus is that you get per-pixel accuracy with the mouse. Most programmers choose to cast a ray from the camera “into” the scene and intersect with bounding boxes around the objects in the scene, or possibly even the object’s triangle mesh if the bounding box test passes. This is much faster but not necessarily pixel-perfect.As for your strategy, I’m not sure exactly what you’ve done so I’m having a hard time recreating it. You can’t have a 1×1 back buffer or read from a 1×1 texture (or any texture for that matter), so I’m not sure how you’re reading just one pixel. If you’re reading the whole screen back as in the “object buffer” approach above, the performance should be just as awful as in the article.
In any case, unless you really need per-pixel accuracy I would recommend going the “ray casting” approach with good-fitting bounding boxes. There are plenty of tutorials online covering this topic, which is called “picking”.
Have fun,
-Jackson
#4 by Alex on October 10th, 2011 ·
My results:
640×480: DirectX9 readback (no alpha) – 125 ms, Software (direct blitting) – 10 ms.
IE 8, Win7, Flash Player 11.0.1.152. What could it be? Do I need IE9 to get acceleration?
#5 by jackson on October 10th, 2011 ·
You don’t need IE9 to get hardware acceleration; I’ve used Chrome and Firefox successfully. When you’re in DirectX mode, does it display “DirectX9 (Direct blitting)” next to the draw time?
(my results on Windows 7 are similar to the Mac results in the article)
#6 by Alex on October 11th, 2011 ·
It displays this:
DirectX9 (Direct) – Readback (no alpha): 117 ms
#7 by skyboy on October 10th, 2011 ·
FF3.6, WinXP, FP 11.0.1.152, Intel 82845(G/GL/GE/PE/GV; not sure which) chipset, Intel Celeron 2.7 GHz; native/active resolution of 1680×1050
#8 by orion on October 13th, 2011 ·
you’ve read Ben Garney’s recommendations on best practices for 3D in flash ?
http://blog.bengarney.com/2010/11/01/tips-for-flash-developers-looking-at-hardware-3d/
notably: “Never ever read back from a GPU resource”.
as a long-time openGL person, i thoroughly agree: if you want performance, don’t readback.
if you don’t care about graphics performance (and that’s understandable for folks used to flash)
then it’s fine, but if you want 60FPS with significant complexity, you have to take the realities of a GPU into account.
#9 by Waetherman on January 18th, 2017 ·
Why do you misquote? You ommitted an important part: “Ok – you can maybe do this ONCE per frame, if you are careful and build your renderer around it.”
I also have to disagree with a tip of capped framerate. In the time you wrote this post, there probably wasn’t a framerate police, but there is one now. Also I heard Total Buiscuit didn’t play Hyper Light Drifter, because it was capped to 30 FPS – you absolutely want to optmize your game for 60 FPS in the very least (which doesn’t give you a huge range in flash that is capped to 60 FPS).
#10 by AdamCreative on March 12th, 2013 ·
This is very useful information. It seems like the size of the bitmap read affects the bottleneck as you have shown. The question has been asked – what about just reading a small region e.g. 1×1 under the mouse.
Actually I think that could be possible because the Context3D.drawToBitmapData is clipped to the “size of the destination bitmap” http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/display3D/Context3D.html#drawToBitmapData()
That suggests that you could shift the viewing coordinates so that the mouse position (x,y) is shifted to 0,0 and then pass in a small (1×1) destination BitmapData into drawToBitmapData to get the backbuffer color value under the mouse. Assuming the data transfer is linear in the size of the destination then this should be a lot quicker.
Anyone tried anything like that?
#11 by jackson on March 12th, 2013 ·
It’s worth trying out, but I’m guessing it’ll still be quite slow. Since the documentation says it’s “clipped”, that could mean that the whole back buffer is read and then simply discarded except for the pixel you care about. The only way to find out for sure is to set up a real performance test, so perhaps there will be a follow-up article.
#12 by adns on October 26th, 2014 ·
Do you know why it is so slow? I got 30ms on 1920×1080. I can ping a server across the continent faster with wi-fi connection… :/
#13 by jackson on October 26th, 2014 ·
All (multi-threaded) rendering must stop and a huge number of pixels (1920*1080*3 bytes ~= 6 MB) must be transferred from the frame buffer in video memory to a location in system memory where a
BitmapData
is allocated. The exact times will depend on your system (graphics system and driver, memory bus, memory architecture, etc.), but 30ms doesn’t sound unreasonable given that I got 15ms on my test machine. I could certainly see many Android devices with dedicated VRAM taking that long. In short, only do this when absolutely necessary. For example, it’s necessary to save screenshots but you shouldn’t try to record or stream a video with it.