{"id":2215,"date":"2020-08-23T23:09:22","date_gmt":"2020-08-23T15:09:22","guid":{"rendered":"http:\/\/blog.coolcoding.cn\/?p=2215"},"modified":"2020-08-23T23:09:22","modified_gmt":"2020-08-23T15:09:22","slug":"a-look-at-the-powervr-graphics-architecture-tile-based-rendering","status":"publish","type":"post","link":"https:\/\/blog.coolcoding.cn\/?p=2215","title":{"rendered":"A look at the PowerVR graphics architecture: Tile-based rendering"},"content":{"rendered":"\n<p><a href=\"https:\/\/www.imgtec.com\/blog\/a-look-at-the-powervr-graphics-architecture-tile-based-rendering\/\">https:\/\/www.imgtec.com\/blog\/a-look-at-the-powervr-graphics-architecture-tile-based-rendering\/<\/a><\/p>\n\n\n\n<p>I\u2019m fond of telling the story about why I joined Imagination. It goes along the lines of: despite offers to go work on graphics in much sunnier climes, I took the job working on PowerVR Graphics here in distinctly un-sunny Britain because I was really interested in how\u00a0<em><strong><a rel=\"noreferrer noopener\" href=\"https:\/\/www.imgtec.com\/powervr\/powervr-architecture.asp\" target=\"_blank\">Tile-Based Deferred Rendering (TBDR)<\/a><\/strong><\/em>\u00a0could work in practice. My graphics career to-date had been mostly focused on the conceptually simpler\u00a0<em><strong>Immediate Mode Renderers (IMRs)<\/strong><\/em>\u00a0of the day \u2013 mostly GeForces and Radeons.<\/p>\n\n\n\n<p>And no offence to the folks who designed said GeForces and Radeons \u2013 a few of whom I am friends with and many more I know quite well, but the front-end architecture of a modern discrete IMR GPU isn\u2019t the most exciting thing in the world. Designed around having plenty of dedicated bandwidth, those GPUs go about the job of painting pixels in a reasonably inefficient way, but one that\u2019s conceptually simple and manifests itself in silicon in a similarly simple way, which makes it relatively easy for the GPU architect to design, spec and have built by the hardware team.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-XMQm6gWE\/w:auto\/h:auto\/q:90\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2013\/05\/IMR-Pipeline-1.jpg\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-XMQm6gWE\/w:700\/h:171\/q:90\/dpr:1.1\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2013\/05\/IMR-Pipeline-1.jpg\" alt=\"IMR Pipeline\" class=\"wp-image-3749\"\/><\/a><\/figure><\/div>\n\n\n\n<p>Immediate Mode Rendering at work<\/p>\n\n\n\n<p>With an IMR you send work to the GPU and it gets drawn straight away. There\u2019s little connection to what else has already been drawn, or will be drawn in the future. You send triangles, you shade them. You rasterise them into pixels, you shade those. You send the rendered pixels to the screen. Triangles in, pixels out, job done! But, crucially the job is done with no context of what\u2019s already happened, or what might happen in the future.<\/p>\n\n\n\n<p>PowerVR GPUs are about as different as they come in that respect, and it\u2019s that which made me take the job here, to figure out how PowerVR\u2019s architects, hardware teams and software folks had made TBDR actually work in real products. My instinct was that TBDRs would be too complex to build so that they\u2019d work well and actually provide a benefit. I had a chance to figure it out and five years later I\u2019m still here, helping figure out how we\u2019ll evolve it in the future, along with the rest of the GPU\u2019s microarchitecture.<\/p>\n\n\n\n<p>As far as the graphics programmer is concerned, PowerVR still looks like&nbsp;<em>triangles in, pixels out, job done<\/em>. But under the hood something much more exciting is happening. And while the exciting part put me in this chair so I could write about it 5 years later, crucially it\u2019s also that other good E word: efficient!<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">It always starts with the classic TBDR vs. IMR debate<\/h4>\n\n\n\n<p>To help understand why, let\u2019s keep talking about IMRs. One of the biggest things stopping a desktop-class IMR scaling down to fit the power, performance and area budgets of modern embedded application processors is bandwidth. It\u2019s such a scarce resource, even in high-end processors \u2013 mostly because of power, area, wiring and packaging limitations, among other things \u2013 that you&nbsp;<strong><em>really<\/em><\/strong>&nbsp;need to use it as efficiently as possible.<\/p>\n\n\n\n<p>IMRs don\u2019t do that very well, especially when pixel shading. Remember that there are usually a great many more pixels being rendered than the vertices used to build triangles. On top of that, with an IMR pixels are often still shaded despite never being visible on the screen, and that costs large amounts of precious bandwidth and power. Here\u2019s why.<\/p>\n\n\n\n<p>Textures for those pixels need to be sampled, and those pixels need to be written out to memory \u2013 and often read back in and written out again! \u2013 before being drawn on the screen. While all modern IMRs have means in hardware to try and avoid some of that redundant work, say one building in the background being completely obscured by one drawn closer to you, there are things the application developer can do to effectively disable those mechanisms, such as always drawing the building in the background first.<\/p>\n\n\n\n<p>In our architecture it doesn\u2019t really matter how the application developer draws what\u2019s on the screen. There are exceptions for non-opaque geometry, which the developer still needs to manage, but otherwise we\u2019re submission order independent. That capability is something we\u2019ve had in our hardware since before we were ever an IP company and still made&nbsp;<a href=\"https:\/\/www.imgtec.com\/news\/detail.asp?ID=58\" target=\"_blank\" rel=\"noreferrer noopener\">our own standalone PC and console GPUs<\/a>. You could draw the building in the background first, then the one in the foreground on top, and we\u2019ll never perform pixel shading for the first one, unlike an IMR.<\/p>\n\n\n\n<p>We effectively sort all of the opaque geometry in the GPU, regardless of how and when it was submitted by the application, to figure out the top-most triangles. Sure, if a developer perfectly sorts their geometry then an IMR can get much closer to our efficiency, but that\u2019s not the common case by any means.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-iyMYZ4WC\/w:auto\/h:auto\/q:90\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2013\/05\/TBDR-Pipeline-1.jpg\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-iyMYZ4WC\/w:700\/h:188\/q:90\/dpr:1.1\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2013\/05\/TBDR-Pipeline-1.jpg\" alt=\"TBDR Pipeline\" class=\"wp-image-3750\"\/><\/a><\/figure><\/div>\n\n\n\n<p>PowerVR TBDRs<\/p>\n\n\n\n<p>Think again about all the work that\u2019s saving, especially for modern content: for&nbsp;<strong><em>every pixel<\/em><\/strong>&nbsp;shaded there are going to be a non-trivial amount of texture lookups for various things, dozens and sometimes hundreds of ALU cycles spent to run computation on that texture data in order to apply the right effects, which often means writing the pixel out to an intermediate surface to be read back in again in a further rendering pass, and then the pixel needs to be stored in memory at the end of shading, so it can be displayed on screen.<\/p>\n\n\n\n<p>And that\u2019s just one optimisation that we have. So even though we\u2019ve avoided processing completely occluded geometry, there\u2019s still bandwidth saving work we can do at the pixel processing stage. Because we split the screen up into tiles, where we figure out all of the geometry that contributes to the tile so we only process what we need to, and we know exactly how big the tile is (currently 32\u00d732 pixels, but it\u2019s been smaller and even non-square in prior designs), we can build enough on-chip storage to process a few of those tiles at a time, without having to use storage in external memory again until we\u2019ve finished and want to write the final pixels out.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-vsZuZTWw\/w:auto\/h:auto\/q:90\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2013\/10\/TBDR-architecture-1.png\"><img decoding=\"async\" src=\"http:\/\/blog.coolcoding.cn\/wp-content\/uploads\/2020\/08\/TBDR-architecture-1.png\" alt=\"TBDR architecture\" class=\"wp-image-5096\"\/><\/a><\/figure><\/div>\n\n\n\n<p>PowerVR GPUs split the screen into tiles<\/p>\n\n\n\n<p>There are secondary benefits to working on screen, region at a time; benefits that other GPUs take advantage of too: because it\u2019s highly likely that a pixel on the screen will share some data with its immediate neighbours, it\u2019s likely that when we move on to processing the neighbouring pixels that we\u2019ve fetched the data into cache and don\u2019t have to wait for another set of external memory accesses, saving bandwidth again. It\u2019s a classic exploitation of spatial locality that\u2019s present in a lot of modern 3D rendering.<\/p>\n\n\n\n<p>So that\u2019s the top-level view of the biggest benefits of a TBDR versus an IMR in terms of processing and (especially) bandwidth efficiency. But how does it actually work in hardware? If you\u2019re not too hardware inclined you can stop here! If you go no further, you\u2019ll still have understood the big top-level benefits of how we go about making best use of the available and very precious bandwidth, throughout rendering on in embedded, low-power systems.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">How TBDR works in hardware<\/h4>\n\n\n\n<p>For those interested in how things happen in the hardware, let\u2019s talk about the tiler in context of&nbsp;<a href=\"https:\/\/www.imgtec.com\/blog\/powervr-rogue-designing-an-optimal-architecture-for-graphics-and-gpu-compute\" target=\"_blank\" rel=\"noreferrer noopener\">a modern Rogue GPU<\/a>.<\/p>\n\n\n\n<p>A 3D graphics application starts by telling us where its geometry is in memory, so we ask the GPU to fetch it and perform vertex shading. We have blocks in our GPUs that are responsible for the non-programmable steps of each kind of task type, called&nbsp;<em><strong>the data masters<\/strong><\/em>. They do a bunch of different things on behalf of&nbsp;<em><strong>the Universal Shading Cluster or USC<\/strong><\/em>&nbsp;(our shading core) to do the fixed function bits of any workload, including fetch data from memory. So because we\u2019re vertex shading, it\u2019s&nbsp;<em><strong>the vertex data master (VDM)<\/strong><\/em>&nbsp;that gets involved at this point, to fetch the vertex data from memory based on information provided by the driver.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-csu2ht7t\/w:auto\/h:auto\/q:90\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2014\/10\/PowerVR-Series7-Series7XT-architecture.png\"><img decoding=\"async\" src=\"http:\/\/blog.coolcoding.cn\/wp-content\/uploads\/2020\/08\/PowerVR-Series7-Series7XT-architecture.png\" alt=\"PowerVR Series7 - Series7XT architecture\" class=\"wp-image-6881\"\/><\/a><\/figure><\/div>\n\n\n\n<p>PowerVR Series7XT is the latest family of Rogue GPUs<\/p>\n\n\n\n<p>The data could be stored in memory as lines, triangles or points. It could be indexed or non-indexed. There are associated shader programs and accompanying data for those programs. The VDM fetches whatever\u2019s needed, using another couple of internally-programmable blocks to help, and emits it all to the USC for vertex shading. The USC runs the shader program and the output vertices are stored on-chip.<\/p>\n\n\n\n<p>They\u2019re then consumed by hardware that performs primitive assembly, certain kinds of culling, and then clipping. If the geometry is back-facing or can be determined to be completely off the screen, it\u2019s culled. All of the remaining on-screen front-facing geometry is sent to be clipped. There\u2019s a fast path here for geometry that doesn\u2019t intersect a clip plane, to let it make onwards progress with no extra processing bar the intersection test. If the geometry intersects with a plane, the clipper generates new geometry so that passed-on vertices are fully on-screen (even though they might be right at the edges). The clipper can do some other cool things, but they\u2019re not too relevant for a big picture explanation like this.<\/p>\n\n\n\n<p>Then we\u2019re on to where a lot of the magic happens, compared to IMRs. We\u2019re obviously aiming to run computation in multiple phases in the hardware, to maximise efficiency and occupancy: one front-end phase to figure out what\u2019s going on with shaded geometry and bin it into tiles, then one phase to consume that binned data, rasterise it and pass it on for pixel shading and final write-out. To keep things as efficient as possible that intermediate acceleration structure between the two main phases has to be as optimal as we can make it.<\/p>\n\n\n\n<p>Clearly it\u2019s a bandwidth cost to create it and read it back, one which our competitors like to pick on when it comes to a competitive advantage they have over us. And it\u2019s true; an IMR doesn\u2019t have to deal with it. But given the bandwidth savings we have in our processing model, creating that acceleration structure \u2013 which we call&nbsp;<em><strong>the Parameter Buffer (PB)<\/strong><\/em>&nbsp;\u2013 before feeding it to our trick rasteriser, we still end up with a huge bandwidth advantage in typical rendering situations, especially complex game-like scenes.<\/p>\n\n\n\n<p>So how do we generate the PB? The clipper outputs a stream of primitives and render target IDs into memory, grouped by render target. Think of it as a container around collections of related geometry. The relationship is critical: we don\u2019t want to read it later and not consume a majority of the data that\u2019s inside, since that\u2019d be wasteful. The output at this stage is the main data structure stored in the PB. We then compress the memory, and in the general case it always compresses very well, so we save quite a lot of PB creation bandwidth just from that step alone.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The concept of tiling<\/h4>\n\n\n\n<p>Now for the bit that most people sort of understand about our hardware architecture: tiling. The tiling engine has one main job: output some data that marks out a tiled region, some associated state, and a set of pointers to the geometry that contributes to that region. We also store masks just in case a primitive doesn\u2019t actually contribute to the region, but is stored in memory anyway. That lets us save some bandwidth and processing for that geometry, because it doesn\u2019t contribute to the tile.<\/p>\n\n\n\n<p>We call the resulting data structure a primitive list. If you\u2019ve ever consumed any of&nbsp;<a href=\"https:\/\/www.imgtec.com\/developers\/documentation\/\" target=\"_blank\" rel=\"noreferrer noopener\">our developer documentation<\/a>, you\u2019ll have seen mention of primitive lists as the intermediate data structure between the front-end phase and the pixel processing phase. Next, some more magic that\u2019s specific to the PowerVR way of doing things.<\/p>\n\n\n\n<p>Imagine you were tasked with building this bit of the architecture yourself, where you had to determine what regions need to be rasterised for a given set of geometries. There\u2019s one obvious algorithm you could choose: the bounding box. Draw a box around the triangle that covers its extents, and whatever tiles that box touches are the ones you rasterise for that triangle. That falls down pretty quickly though, efficiency wise.<\/p>\n\n\n\n<p>Imagine a fairly long and thin triangle drawn across the screen in any orientation. You can quickly picture that the bounding box for that triangle is going to lie over tiles that the triangle doesn\u2019t actually touch. So when you rasterise, you\u2019re going to generate work for your shading core, but where nothing is actually going to happen in terms of a contribution to the screen.<\/p>\n\n\n\n<p>Instead, we have an algorithm baked into the hardware which we call&nbsp;<em><strong>perfect tiling<\/strong><\/em>. It works as you\u2019d expect: we only generate tile lists where the geometry actually covers some area in the tile. It\u2019s one of the most optimised and most efficient parts of the design. The perfect tiling engine generates that perfect list of tiles for a given set of geometry.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-l1ZrrIfa\/w:auto\/h:auto\/q:90\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2015\/03\/PowerVR-TBDR-perfect-tiling.png\"><img decoding=\"async\" src=\"https:\/\/mlhr8q6s8c91.i.optimole.com\/STwO8dY-l1ZrrIfa\/w:700\/h:205\/q:90\/dpr:1.1\/https:\/\/www.imgtec.com\/wp-content\/uploads\/2015\/03\/PowerVR-TBDR-perfect-tiling.png\" alt=\"PowerVR TBDR - perfect tiling\" class=\"wp-image-7599\"\/><\/a><\/figure><\/div>\n\n\n\n<p>PowerVR perfect tiling vs. bounding box or hierarchical tiling<\/p>\n\n\n\n<p>That tile information plus the primitive lists are packed into the PB as efficiently as we can, and that\u2019s conceptually pretty much it. In reality there\u2019s a heck of a lot that still happens here in the hardware at the back-end phase of tiling, to fill the PB and organise and marshal the actual memory accesses for the external memory writes, but in terms of functionality to wrap your head around, we\u2019re pretty much done.<\/p>\n\n\n\n<p>That front-end hardware architecture for us is really where a really big chuck of the efficiency gains can be found in a modern PowerVR GPU, compared to some of our competition. Surprisingly to those who find out, it\u2019s part of the hardware architecture that\u2019s not actually that different, at least at the top-level, between Rogue and SGX. While we completely redesigned the shader core for Rogue, the front-end architecture actually bears a strong resemblance to the one you\u2019ll find in&nbsp;<a href=\"https:\/\/www.imgtec.com\/blog\/understanding-powervr-sgx-mobiles-leading-gpu\" target=\"_blank\" rel=\"noreferrer noopener\">the later generation SGX GPU IPs<\/a>. It works and it works very well.<\/p>\n\n\n\n<p>And now that I\u2019m done explaining the tiling part of our TBDR, it\u2019s a good excuse to stop! I\u2019ll come back to the deferred rendering part in a future blog post, so stay tuned.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.imgtec.com\/blog\/a-look-at-the-powervr-graph [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/posts\/2215"}],"collection":[{"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2215"}],"version-history":[{"count":1,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/posts\/2215\/revisions"}],"predecessor-version":[{"id":2221,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=\/wp\/v2\/posts\/2215\/revisions\/2221"}],"wp:attachment":[{"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.coolcoding.cn\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}