Sunday, February 13, 2011

Virtual memory on PC

There is an excelent post about virtual memory. It's written mainly from a perspective of console developer. On consoles most of memory issues are TLB misses and physical memory limit. I'll try to write more about how (bad) it looks on PC (windows) with 32 bits programs. Especially nowadays when games require more and more data.

Firstly half of program's virtual address space is taken by kernel. This means that first pointer's bit is unused and it can be used for some evil trickery :). Moreover first and last 64kb are reserved by kernel.

Program's source and heap has to be loaded somewhere. When compiling using VC++ default place is 0x0040000. Then a bunch of DLLs are loaded into strange virtual memory addresses. You can check what DLLs are loaded, into what address and see their size using Dependacy Walker. Use start profiling feature to see real virtual memory address of given DLL. DLLs and program usually aren't loaded into one contiguous address range. At this point we didn't call new/malloc even once and virtual memory is already fragmented.

Now there comes video driver. It will use precious virtual memory for managed resources, command buffer and temporary for locking non managed resources. Especially creating/locking non managed resources is quite misinforming as DirectX returns "out of video memory" instead of "out of virtual memory". It's very tempting to put all static level geometry into one 100mb non-managed vertex buffer. When creating/filling this VB video driver will try to allocate contiguous 100mb chunk of virtual memory. This will likely result in program crash after some time.

Windows uses 4kb pages, so doing smaller allocations will lead to internal fragmentation. I guess already everyone is using some kind of custom memory allocator, so it isn't a problem.

There is /LARGEADDRESSAWARE linker flag, which allows to use additional 1gb of virtual memory. It requires user to change boot params and usually doesn't work well in practice (system stability issues etc.). It's also possible to compile as 64 bit program, but according to Steam HW survey half of gamers use a 32 bit OS. This is really annoying that MS is still making 32 bit systems because currently min PC game spec CPUs are core2 or similar with 64 bit support.

Summarizing in theory memory shouldn't be a problem on PC, but in practice it's a precious and fragile resource.

Wednesday, October 27, 2010

Shader optimizations

A small list of basic and sometimes overlooked shader optimization possibilities. This are very small gains, but they can sum up and maybe there will be some free time for an additional point light or a better shadow filter?

Full screen quad vs full screen triangle
Post processing effects or color/z downsampling usually are rendered using full screen quads. Hardware works on at least 2x2 pixels groups. Pixel group size goes up with time (just like expected game resolution). For example NVIDIA Fermi has 4x2 pixel groups and older  NVIDIA G80-G92 use 2x2 quads. This means, that rendering two fullscreen triangles creates some overlapping quads on the diagonal. In 1000x1000 pixel resolution and 2x2 pixel quads, there will be 500 quads shaded two times. If we factor out cache misses, there will be 0.2% of additional work. Besides using single fullscreen triangle there is one vertex less to push to GPU :). In my synthetic test (on crappy geforce 240) difference was around 0.2% - 0.3%.

Direct3D shader compiler (FXC) on PC
Instruction counts displayed by FXC on PC doesn't mean much nowadays. It just translates HLSL to asm, which later will be translated by the driver to special hardware IL. It's quite possible to decrease instruction count displayed by the FXC and slowdown shader at the same time. Instead of relying on FXC numbers it's better to check real performance (FPS/ms) or/and check numbers generated by special tools (ShaderAnalyzer and ShaderPerf). They also display GPR count, which is quite important as it shows how much stuff can be run in parallel.

Hardware instructions
  • MADD is a hardware instruction on GPU. Convert code like "( x - a ) * b" to "x * c + d". This can save 1 ALU instruction.
  • Saturate, negation and abs are instruction modifiers and are free. Yes, there is a free dinner :). Sometimes equations can be changed to use saturate instead of clamp/min/max. Negation and abs can help to decrease number of used constant registers.
  • Some instructions are executed on the transcendental units. Transcendental units compute everything as scalars and there are like one transcendental unit per 2-8 ALUs. It's a good idea to avoid excessive usage of instructions like sin, cos, log, sqrt, pow (very bad - calculated using three instructions).

Vectorize with care
Some GPUs have vector ALU units (AMD/ATI cards and NVIDIA cards older than G80) and some have scalar (NVIDIA G80, G92, Fermi). A vector ALU means that a scalar instruction takes same time as a vector one, which computes 4 components at once. Usually people try to vectorize everything in shaders, which can add some additional computations and actually result in slower shader on scalar ALU hardware. It's a good idea to mask vector computations. For example in a blur shader there is no need to calculate alpha channel, so just use float3 for accumulation. We could go further and even write two shader versions: one for vector ALUs and one for scalar ones. No point in vectorizing instructions computed on transcendal units (sin, cos, log, pow...) - they are always scalar.

Clip/texkill
Consider adding clip/texkill (or alpha test, which can be faster on old hardware) when alpha blending is enabled. Think deferred lights, particles, volumetric light shafts. This can remove some work from ROP units if You don't have uber tight geometry.

Interpolators
Shader bottlenecks aren't only about ALU, GPR and texture fetches. On rare occasions (or on some hardware) they can become a bottleneck. Sometimes when using short pixel shaders it's better to move computations from vertex shader to pixel shader if it can help to decrease interpolator count.
// 8 interpolators and minimal ALU in pixel shader
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv[ i ];
    }
    return color;
}

// one interpolator and some ALU in pixel shader
float4 gSomeValMul[ 8 ];
float4 gSomeValAdd[ 8 ];
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv * gSomeValMul[ i ] + gSomeValAdd[ i ];
    }
    return color;
}
8 interpolator version runs at 28.17ms (100 runs on geforce 240). 1 interpolator + some ALU version runs at 21.44ms (just as empty pixel shader). This is of course a very specific case. Still it's a good idea to watch out and pack interpolators.

Monday, September 27, 2010

Software occlusion culling


Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn't a new idea (HOM). Basically it's filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256x114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one - transformation pipeline. It can be a bottleneck if it isn't properly done. Step two - clipper. Clipper code quality isn't so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three - scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for "min-spec" PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy - it's enough to interpolate 1/z' (z' = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z'. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it's enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256x128 resolution and 1 thread. CPU / quad pixel screen size:

256x128 61x61 21x21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s0.025 GV/s
core2 quad Q8200 (2.33ghz)9.20 ms2.30 ms0.67 ms1.76 GPix/s0.019 GV/s

This shows true power of i7 - almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included - should be already done for frustum culling and it's quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case - no early out (cleared z-buffer). 256x128 resolution. 1 thread. CPU / quad pixel screen size:

120x120 30x30 10x10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz)3.30ms0.09 ms0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast - something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool - You can have skinned occluders :). It's easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won't be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it's easier to parallelize. Maybe one day I'll try to add it to our engine at work and see how does it handle real test cases.

Saturday, August 14, 2010

Aggregated deferred lighting

Random idea about a new way to do deferred lighting. The idea is to decouple lighting from geometry normals. In order to do that, lighting information is stored as aggregated lights ( direction + color ).

1st pass - z-prepass ( just render depth )
2nd pass - render lighting geometry / quads / tiles.... Output aggregated virtual directional lights for every pixel. This means weighted average of light directions and weighted sum of light colors for every pixel.
3rd pass - render geometry and shade using buffer with aggregated directional lights (and maybe add standard forward directional light)

2nd pass render target layout:
RT0: aggregated light color RGB
RT1: aggregated light direction XYZ

We want to achieve this:
AggregatedLightColor = 0.
AggregatedLightDir   = 0.

for every light
    AggregatedLightColor += LightColor * LightAttenuation
    AggregatedLightDir   += LightDir * intensity(LightColor * LightAttenuation)

In order to do this, we need:
1. Init RT0 and RT1 with 0x00000000
2. Setup additive blending states
3. Output from light pixel shader:
ColorRT0 = LightColor * LightAttenuation
ColorRT1 = LightDirection * dot( ColorRT0, ToGrayscaleVec )

Cons?
  • Light aggregation as virtual directional lights per pixel is an approximation. Moreover we can't properly blend normals by using their arithmetic averages. It means that with many lights per pixel (with opposing directions) it won't be too accurate (but it shouldn't be too visible).


Benefits?
  • Flexibility. You can use almost any lighting model
  • You can render lighting in lower resolution as high frequency normal map details are added later. There will be artifacts at depth discontinuities, but maybe for some type of content (think desaturated and gray as Gears of War or Killzone 2 :)) they won't be to visible
  • Less bandwidth and memory usage (if we compare it to deferred lighting and shading, which stores full specular color, not just it's intensity).
  • Z prepass is faster than rendering GBuffer or normals + exponent
  • A bit simpler calculations. No need for encoding / decoding material properties (normal, exponent,...).

Now it's time to find some free time and code a demo in order to compare it to deferred lighting/shading in real application :).

P.S. decoupling can be also done by storing lighting as spherical harmonics or cubemaps: link1 link2 link3 ( thanks Hogdman from gd.net forums ). Downside of that method is lack of proper specular, because of low frequency lighting data and this method will be slower.

P.S. 2 It looks like it would be better to store normals as angles (RT1.xy - weighted 2 angles, RT1.z - sum of weights). It would ensure proper aggregated light direction interpolation.




UPDATE: I prototyped this method and it doesn't work too well :). Comparison screenshot with hard case for idea - two points lights with very different color influencing same area. Left - normal lighting and right - aggregated to direction and color:


Friday, August 13, 2010

Rendering light geometry in deferred shading/lighting

Interesting idea from Call Of Juarez 2 about rendering deferred light geometry. When deferred light geometry intersects with camera you need to switch culling and turn off zbuffer. In COJ2, instead of testing intersection on CPU and switching states, they just push out light geometry vertices:

// vertex shader
float3 posCS = mul( in.pos, worldToCamera ).xyz;
posCS.z = max( posCS.z, nearPlaneZ + offset );
out.pos = mul( float4( posCS, 1. ), cameraToScreen );

Could be a win if You are CPU bound.

Wednesday, July 28, 2010

Siggraph 2010 papers

A small list of Siggraph 2010 papers (I'll try to keep it up to date):


Wednesday, July 7, 2010

VC++ and multiple inheritance

Today at work we were optimizing memory usage. At some moment we found out that size (on stack) of our basic data structures is x bytes bigger than summed size of their members. Every basic data structure was written following Alexandrescu policy based design - using inheritance from some templated empty classes. Let's see a simple example:

#include <stdio.h>

class A { };
class B { };
class C : public A, B 
{ 
    int test; 
};

int main()
{
    printf( "%d\n", sizeof( C ) );
    return 0;
}

Compiler uses 4 byte aligment. Will this program print 4? That depends. Compiled by GCC it will print 4, but compiled by VC++ (2005-2010) it will print 8.

Every class in C++ has to be at least 1 byte of size in order to have a valid memory adress. With multiple inheritance sizeof(C) = sizeof(A) + sizeof(B) + some aligment. So VC++ behavior is correct, but not optimal. It's strange that it was reported to MS in 2005 and still they didn't fix it.