Monday, March 25, 2013

LA Noire

LA Noire has some amazing tech for face animations. Basically actors are filmed from multiple cameras and resulting movies are converted to a keyframed animation and animated textures. All the textures are captured in neutral lighting conditions, so usually lighting doesn't fit in game environment. Looks like that those textures are animated at around 3 frames per second. Eyes are animated separately and at higher rate. This approach has also some interesting "artifacts", as it's impossible to capture everything during one day. For example you can see as hair shifts and changes when blending between two performances captured at different days:


More info:
LA Noire face tech animation trailer
LA Noir tech description by IQGamer
MotionScan website

Saturday, October 6, 2012

Unreal Engine 4 gaussian specular normalization

Recently Epic did a nice presentation about their new tech: "The technology behind Unreal 4 Elemental demo". Among a lot of impressive stuff they showed their gaussian specular aproximation. Here is a BRDF with U4 specular for Disney's BRDF explorer:

analytic

::begin parameters
float n 1 512 100
bool normalized 1
::end parameters

::begin shader

vec3 BRDF( vec3 L, vec3 V, vec3 N, vec3 X, vec3 Y )
{
    vec3 H = normalize( L + V );
    float Dot = clamp( dot( N, H ), 0, 1 );
    float Threshold = 0.04;
    float CosAngle = pow( Threshold, 1 / n );
    float NormAngle = ( Dot - 1 ) / ( CosAngle - 1 );
    float D = exp( -NormAngle * NormAngle );

    if ( normalized )
    {
        D *= 0.17287429 + 0.01388682 * n;
    }
 
    return vec3( D );
}

::end shader

This aproximation was tweaked to have less aliasing than the standard Blinn-Phong specular (it has smoother falloff):

 
 

Mentioned presentation doesn't include a normalization factor for it. It was a nice excuse for spending some time with Mathematica and try to derive it myself.

Basic idea of normalization factor is that lighting needs to be energy conserving (outgoing energy can't be greater than incoming energy). This means that integral of BRDF times cos(theta) over upper hemisphere can't exceed 1 or more specifically in our case we want it to be equal 1:


The highest values will be when light direction equals normal (L=N). This means that we can replace dot(N,H) with cos(theta/2), as now angle between H (halfway vector) and N equals to half of angle between L and N. This greatly simplifies the integral. Now we can replace the f(l,v) with U4 gaussian aproximation:


Unfortunately neither I nor Mathematica could solve it analytically. So I had to calculate values numerically and try to fit various simple functions over range [1;512]. The best aproximation which I could find was: 0.17287429 + 0.01388682 * n. Where n is Blinn-Phong specular power.



As you can see it isn't accurate for small specular power values, but on the other hand it's very fast and specular power below 16 aren't used often.

Monday, June 4, 2012

Visual C++ linker timestamp

Nice trick to get build's timestamp at runtime (Visual C++ only):

EXTERN_C IMAGE_DOS_HEADER __ImageBase;

(...)

IMAGE_NT_HEADERS const* ntHeader 
    = (IMAGE_NT_HEADERS*) ( (char*) &__ImageBase + __ImageBase.e_lfanew );
DWORD const timeStamp = ntHeader->FileHeader.TimeDateStamp;

It's not very portable, but it doesn't require any additional recompilation (as __DATE__ __TIME__ macros do).

Sunday, May 6, 2012

PixelJunk Eden data extractor

Recently Q-Games ported one of their games - PixelJunk Eden to Windows. Actually it's their first Windows release. From tech perspective it's not as impressive as PixelJunk Shooter series, but still it was quite interesting to poke around this game's files. Unfortunately game data was stored in custom format and encrypted. I wonder why people waste time to encrypt game data.

I had to reverse engineer data files and wrote a simple extractor program. You can download sources here. The rest of the post contains information about encryption and file formats.

Files are encrypted using a homemade xor encryption scheme:
int seed = fileSize + 0x006FD37D;
for ( unsigned i = 0; i < fileSize; ++i )
{
    int const xorKey = seed * ( seed * seed * 0x73 - 0x1B ) + 0x0D;
    fileByteArr[ i ] ^= xorKey;
    ++seed;
}

Game data is stored in lump_x_x.pak files with description in the lump.idx file. Lump.idx consits of 1 header, multiple lump_x_x.pak descriptors and multiple packed file descriptors. All files (*.idx and *.pak) are encrypted using the mentioned above xor scheme.

lump.idx header:
struct IndexHeader
{
    char     m_magic[ 4 ];      // "PACK"
    unsigned m_unknown0;
    unsigned m_unknown1;
    unsigned m_packedFileMaxID; // packed file num - 1
    unsigned m_lumpFileMaxSize;
    unsigned m_lumpFileNum;
    char     m_align[ 230 ];
};

lump.idx file descriptors of lump_x_x.pak files:
struct IndexLumpDesc
{
    unsigned      m_unknown;
    unsigned      m_lumpSize;
    unsigned char m_lumpPartID;
    unsigned char m_lumpID;
    char          m_align[ 2 ];
};

lump.idx file descriptors of packed files:
struct IndexFileDesc
{
    char     m_filename[ 120 ];
    unsigned m_offset;
    unsigned m_size;
};

lump_x_x.pak files contain packed files at specified offsets. Every packed file is stored at offset aligned to 128 bytes.

Sunday, February 13, 2011

Virtual memory on PC

There is an excelent post about virtual memory. It's written mainly from a perspective of console developer. On consoles most of memory issues are TLB misses and physical memory limit. I'll try to write more about how (bad) it looks on PC (windows) with 32 bits programs. Especially nowadays when games require more and more data.

Firstly half of program's virtual address space is taken by kernel. This means that first pointer's bit is unused and it can be used for some evil trickery :). Moreover first and last 64kb are reserved by kernel.

Program's source and heap has to be loaded somewhere. When compiling using VC++ default place is 0x0040000. Then a bunch of DLLs are loaded into strange virtual memory addresses. You can check what DLLs are loaded, into what address and see their size using Dependacy Walker. Use start profiling feature to see real virtual memory address of given DLL. DLLs and program usually aren't loaded into one contiguous address range. At this point we didn't call new/malloc even once and virtual memory is already fragmented.

Now there comes video driver. It will use precious virtual memory for managed resources, command buffer and temporary for locking non managed resources. Especially creating/locking non managed resources is quite misinforming as DirectX returns "out of video memory" instead of "out of virtual memory". It's very tempting to put all static level geometry into one 100mb non-managed vertex buffer. When creating/filling this VB video driver will try to allocate contiguous 100mb chunk of virtual memory. This will likely result in program crash after some time.

Windows uses 4kb pages, so doing smaller allocations will lead to internal fragmentation. I guess already everyone is using some kind of custom memory allocator, so it isn't a problem.

There is /LARGEADDRESSAWARE linker flag, which allows to use additional 1gb of virtual memory. It requires user to change boot params and usually doesn't work well in practice (system stability issues etc.). It's also possible to compile as 64 bit program, but according to Steam HW survey half of gamers use a 32 bit OS. This is really annoying that MS is still making 32 bit systems because currently min PC game spec CPUs are core2 or similar with 64 bit support.

Summarizing in theory memory shouldn't be a problem on PC, but in practice it's a precious and fragile resource.

Wednesday, October 27, 2010

Shader optimizations

A small list of basic and sometimes overlooked shader optimization possibilities. This are very small gains, but they can sum up and maybe there will be some free time for an additional point light or a better shadow filter?

Full screen quad vs full screen triangle
Post processing effects or color/z downsampling usually are rendered using full screen quads. Hardware works on at least 2x2 pixels groups. Pixel group size goes up with time (just like expected game resolution). For example NVIDIA Fermi has 4x2 pixel groups and older  NVIDIA G80-G92 use 2x2 quads. This means, that rendering two fullscreen triangles creates some overlapping quads on the diagonal. In 1000x1000 pixel resolution and 2x2 pixel quads, there will be 500 quads shaded two times. If we factor out cache misses, there will be 0.2% of additional work. Besides using single fullscreen triangle there is one vertex less to push to GPU :). In my synthetic test (on crappy geforce 240) difference was around 0.2% - 0.3%.

Direct3D shader compiler (FXC) on PC
Instruction counts displayed by FXC on PC doesn't mean much nowadays. It just translates HLSL to asm, which later will be translated by the driver to special hardware IL. It's quite possible to decrease instruction count displayed by the FXC and slowdown shader at the same time. Instead of relying on FXC numbers it's better to check real performance (FPS/ms) or/and check numbers generated by special tools (ShaderAnalyzer and ShaderPerf). They also display GPR count, which is quite important as it shows how much stuff can be run in parallel.

Hardware instructions
  • MADD is a hardware instruction on GPU. Convert code like "( x - a ) * b" to "x * c + d". This can save 1 ALU instruction.
  • Saturate, negation and abs are instruction modifiers and are free. Yes, there is a free dinner :). Sometimes equations can be changed to use saturate instead of clamp/min/max. Negation and abs can help to decrease number of used constant registers.
  • Some instructions are executed on the transcendental units. Transcendental units compute everything as scalars and there are like one transcendental unit per 2-8 ALUs. It's a good idea to avoid excessive usage of instructions like sin, cos, log, sqrt, pow (very bad - calculated using three instructions).

Vectorize with care
Some GPUs have vector ALU units (AMD/ATI cards and NVIDIA cards older than G80) and some have scalar (NVIDIA G80, G92, Fermi). A vector ALU means that a scalar instruction takes same time as a vector one, which computes 4 components at once. Usually people try to vectorize everything in shaders, which can add some additional computations and actually result in slower shader on scalar ALU hardware. It's a good idea to mask vector computations. For example in a blur shader there is no need to calculate alpha channel, so just use float3 for accumulation. We could go further and even write two shader versions: one for vector ALUs and one for scalar ones. No point in vectorizing instructions computed on transcendal units (sin, cos, log, pow...) - they are always scalar.

Clip/texkill
Consider adding clip/texkill (or alpha test, which can be faster on old hardware) when alpha blending is enabled. Think deferred lights, particles, volumetric light shafts. This can remove some work from ROP units if You don't have uber tight geometry.

Interpolators
Shader bottlenecks aren't only about ALU, GPR and texture fetches. On rare occasions (or on some hardware) they can become a bottleneck. Sometimes when using short pixel shaders it's better to move computations from vertex shader to pixel shader if it can help to decrease interpolator count.
// 8 interpolators and minimal ALU in pixel shader
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv[ i ];
    }
    return color;
}

// one interpolator and some ALU in pixel shader
float4 gSomeValMul[ 8 ];
float4 gSomeValAdd[ 8 ];
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i < 8; ++i )
    {
        color += In.m_uv * gSomeValMul[ i ] + gSomeValAdd[ i ];
    }
    return color;
}
8 interpolator version runs at 28.17ms (100 runs on geforce 240). 1 interpolator + some ALU version runs at 21.44ms (just as empty pixel shader). This is of course a very specific case. Still it's a good idea to watch out and pack interpolators.

Monday, September 27, 2010

Software occlusion culling


Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.

Software occlusion culling using rasterization isn't a new idea (HOM). Basically it's filling software z-buffer and testing some objects against it (usually screen space bounding boxes). Rasterization is usually done in small resolution (DICE uses 256x114). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).

How to write one? Step one - transformation pipeline. It can be a bottleneck if it isn't properly done. Step two - clipper. Clipper code quality isn't so important. Just remember to clamp coordinates or clip x and y coordinates after projection divide. Step three - scanline or half-space rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach was a win over scanlines when I wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for "min-spec" PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.

Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of  precision (no need to use float for positions). For depth only rendering perspective interpolation is very easy - it's enough to interpolate 1/z' (z' = z/w) and store it in software z-buffer. This means no division or multiplication in inner loop. Moreover when doing visibility for directional shadows there is no perspective, so there is no need for calculating reciprocal of z'. There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In practice it's enough to add some eps when testing objects.

Some rasterization performance results. Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256x128 resolution and 1 thread. CPU / quad pixel screen size:

256x128 61x61 21x21 fillrate vertex rate
i7 860 (2.8ghz) 6.56 ms 1.75 ms 0.53 ms 2.50 GPix/s0.025 GV/s
core2 quad Q8200 (2.33ghz)9.20 ms2.30 ms0.67 ms1.76 GPix/s0.019 GV/s

This shows true power of i7 - almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer. It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like Larabee does). This dramatically increases performance for large triangles.

Some object testing performance results. Transformation time not included - should be already done for frustum culling and it's quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics. Randomly placed 3k quads (each fully visible). Worst case - no early out (cleared z-buffer). 256x128 resolution. 1 thread. CPU / quad pixel screen size:

120x120 30x30 10x10
i7 860 (2.8ghz) 2.26 ms 0.07 ms 0.02 ms
core2 quad Q8200 (2.33ghz)3.30ms0.09 ms0.03 ms

Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be further optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast - something like 0.05ms for full mip-map chain).

Software occlusion culling is quite cool - You can have skinned occluders :). It's easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won't be faster than beamtrees, but IMHO on consoles it can be faster (no tree data structure traversal) and for sure it's easier to parallelize. Maybe one day I'll try to add it to our engine at work and see how does it handle real test cases.