Monday, September 27, 2010

Software occlusion culling


Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn't a new idea (HOM). Basically it's filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256x114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one - transformation pipeline. It can be a bottleneck if it isn't properly done. Step two - clipper. Clipper code quality isn't so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three - scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for "min-spec" PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy - it's enough to interpolate 1/z' (z' = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z'. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it's enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256x128 resolution and 1 thread. CPU / quad pixel screen size:

256x128 61x61 21x21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s0.025 GV/s
core2 quad Q8200 (2.33ghz)9.20 ms2.30 ms0.67 ms1.76 GPix/s0.019 GV/s

This shows true power of i7 - almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included - should be already done for frustum culling and it's quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case - no early out (cleared z-buffer). 256x128 resolution. 1 thread. CPU / quad pixel screen size:

120x120 30x30 10x10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz)3.30ms0.09 ms0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast - something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool - You can have skinned occluders :). It's easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won't be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it's easier to parallelize. Maybe one day I'll try to add it to our engine at work and see how does it handle real test cases.

4 comments:

  1. Here is my experience (we're using this at work on SPUs):

    - transform & clip are more or less the same wrt optimization demands - not sure why you separate them
    - you can only clip against the near plane, the rest is done easier in raster stage
    - backface culling is essential
    - scanlines were ~2x faster than halfspace on SPUs for me
    - we test using aabb screen space rect against the min aabb z (view-space)
    - 256x144, no HOM - did not bother as object tests were fast enough

    ReplyDelete
  2. 1. transform & clip. I separate them because very small percentage of triangles needs to be clipped.

    2. clipping during rast and triangle setup. IMHO on PC it could be faster to clip with 5 planes. Need to check performance of this one.

    4. z in view space. Does it mean that You store view space depth in software depth-buffor? Also why view space and not screen space?

    BTW did You solve problem with false positives? I mean the one because of rendering in lower resolution. Last time when I through about it I understood that it's not enough to shift pixel center when interpolating depth. There can be for example 1 pixel size occluders which will occlude bigger pixel block. It's the main issue which prevents us from replacing beamtrees.

    ReplyDelete
  3. Clipping also consists of the first part where you check if the triangle has to be clipped - that has the same perf demands as the transform. The actual clipping can be slower, yes.

    The rasterization clipping introduces per-scanline overhead as opposed to per-triangle, but it's quite tiny.

    Apologies for the confusion, I use post projection depth for the depth buffer.

    Occlusion false positives are almost impossible to avoid, as far as I understand (you'll have to teach the rasterizer to distinguish inner edges vs outer edges), but it was not a problem for us - since we don't (obviously) render the actual level geometry in occlusion buffer, the occlusion geometry can be made slightly smaller than the real volume.

    ReplyDelete
  4. I see. By clipping I mean triangle cutting, not triangle testing.

    Yes, It would requite outer edge detection. It's something I'm trying to solve now. Slightly smaller occluders won't work for us, as we have death penalty for introducing stuff like occlusion false positives (or templates :)).

    ReplyDelete