Note on sampling GGX Distribution of Visible Normals

Introduction
After writing an AO demo in last post, I started to write a progressive path tracer, but my progress was very slow due to the social unrest in past few months (here are some related news about what has happened). In the past weeks, the situation has claimed down a bit, and I continue to write my path tracer and started adding specular lighting. While implementing Eric Heitz's "Sampling the GGX Distribution of Visible Normals" technique, I was confused by why taking a random sample on a disk and then project it on the hemisphere equals to the GGX distribution of visible normals (VNDF). And I can't find a prove in the paper, so in this post, I will try to verify their PDF are equal. (Originally, I planned to write this post after finishing my path tracer demo. But I worry that the situation here in Hong Kong will get worse again and won't be able to write, so I decided to write it down first, hope it won't get too boring with only math equations.)
My work in progress path tracer, using GGX material only

Quick summary of sampling the GGX VNDF technique
For those who are not familiar with the GGX VNDF technique, I will briefly talk about it. It is an important sampling technique to sample a random normal vector from GGX distribution. That normal vector is then used for generating a reflection vector, usually for the next reflected ray during path tracing.

Traditional importance sampling scheme use D(N) to sample a normal vector
VNDF technique use the visible normal to importance sample a vector, taking the view direction into account
Given a view vector to a GGX surface with arbitrary roughness, the steps to sample a normal vector are:
  1. Transform the view vector to GGX hemisphere configuration space (i.e. from arbitrary roughness to roughness = 1 config) using GGX stretch-invariant property.
  2. Sample a random point on the projected disk along the transformed view direction.
  3. Re-project the sampled point onto the hemisphere along view direction. And this will be our desired normal vector.
  4. Transform the normal vector back to original GGX roughness space from the hemisphere configuration.
VNDF sampling technique illustration from  Eric Heitz's paper
My confusion mainly comes from step 2 and 3, in the hemisphere configuration: why this method of generating normal vector equals to GGX VNDF exactly...

GGX NDF definition
Before digging deep into the problem, let's start with the definition of GGX NDF. In the paper, it states that: The GGX distribution uses only the upper part of the ellipsoid, and when alpha/roughness equals to 1, the GGX distribution is a uniform hemisphere. According to the definition (with alpha = 1):



So its PDF will be:

So, sampling a normal vector from GGX distribution (with alpha = 1) equals to sampling a vector using a cos-weighted distribution.

GGX VNDF definition
The definition of VNDF depends on the shadowing function. And we are using the Smith shadowing function (with alpha =1):


Therefore the VNDF equals to:



GGX VNDF specific case
With both GGX NDF and VNDF definition, we can start investigating the problem. I decided to start with something simple first, with a specific case: view direction equals to surface normal (i.e. V=Z).



After simplification in this V=Z case, the PDF of Dz(N) is also cos-weighted, which equals to the traditional sampling GGX NDF method.

Now take a look at the sampling scheme by Eric Heitz's method. The method start with uniform sampling from a unit disc, which has a PDF = 1/π, then the point is projected to the hemisphere along the view direction, which add a cos term to the PDF (i.e. Z.N/π ) according to Malley's method (where the cos term comes from the Jacobian transform). Therefore, both the VNDF and Eric Heitz's method are the same at this specific case, which has a cos weighted PDF.

GGX VNDF general case
To verify Eric Heitz's sampling scheme equals to the PDF of GGX VNDF in all possible viewing direction, we need to calculate the PDF of his method and take care of how the PDF changes according to each transformation. From the paper we have this vertical mapping:
Transformation of randomly sampled point from Eric Heitz's paper
We know the PDF of sampling an unit disk is 1/π, (i.e. P(t1, t2)= 1/π), we need to calculate P(t1, t2'):



The next step of the algorithm is to re-project the disc to the hemisphere along the view direction, which produce our target importance sampled normal, so by Malley's method again (but this time along the view direction instead of surface normal), we can add a V.N Jacobian term to the above PDF P(t1,t2'):



The resulting PDF equals to the GGX VNDF definition exactly. So this solved my question of why Eric Heitz's sampling scheme is an exact sampling routine for the GGX VNDF.

Conclusion
This post describe my learning process of the paper "Sampling the GGX Distribution of Visible Normals" and solved my most confusing part of why "taking a random sample on a disk and then project it on the hemisphere equals to the GGX VNDF". If anybody knows a simpler proof of how these 2 equations are equal, or if you discover any mistake, please let me know in the comment. Thank you.

References
[1] http://www.jcgt.org/published/0007/04/01/paper.pdf
[2] https://hal.archives-ouvertes.fr/hal-01509746/document
[3] https://agraphicsguy.wordpress.com/2015/11/01/sampling-microfacet-brdf/
[4] https://schuttejoe.github.io/post/ggximportancesamplingpart1/
[5] https://schuttejoe.github.io/post/ggximportancesamplingpart2/
[6] http://www.pbr-book.org/3ed-2018/Monte_Carlo_Integration/2D_Sampling_with_Multidimensional_Transformations.html





DXR AO

Introduction
It has been 2 months since my last post. For the past few months, the situation here in Hong Kong was very bad. Our basic human rights are deteriorating. Absurd things happens such as suspected cooperation between police and triad, as well as the police brutality (including shooting directly at the journalists). I really don't know what can be done... May be, could you spare me a few minutes to sign some of these petitions? Although such petitions may not be very useful, at least after signing some of them, the US Congress is discussing the Hong Kong Human Rights and Democracy Act now. I would sincerely appreciate your help. Thank you very much!

Back to today's topic, after setting up my D3D12 rendering framework, I started to learn DirectX ray-tracing (DXR). So I decided to start writing an ambient occlusion demo first because it is easier than writing a full path tracer since I do not need to handle material information as well as the lighting data. The demo can be downloaded from here (required a DXR compatible graphics card and driver with Windows 10 build version 1809 or newer).



Rendering pipeline
In this demo, it renders a G-buffer with normal and depth data. Then a velocity buffer will be generated using current and previous frame camera transform, stored in RG16Snorm format. Then rays are traced from world position reconstructed from depth buffer with cosine weight distribution. To avoid ray-geometry self intersection, ray origin is shifted towards the camera a bit. After that, a temporal and spatial filter is applied to smooth out the noisy AO image and then an optional bilateral blur pass can be applied for a final clean up.



Temporal Filter
With the noisy image generated from the ray tracing pass, we can reuse previous frame ray-traced data to smooth out the image. In the demo, the velocity buffer is used to get the pixel location in previous frame (with an additional depth check between current frame depth value and the re-projected previous frame depth value). As we are calculating ambient occlusion using Monte Carlo Integration:


We can split the Monte Carlo integration into multiple frames and store the AO result into a RG16Unorm texture, where red channel stores the accumulated AO result, green channel stores the total sample count N (The sample count is clamped to 255 to avoid overflow). So after a new frame is rendered, we can accumulated the AO Monte Carlo Integration with the following equation:



We also reduce the sample count by the delta depth difference between current and previous frame depth buffer value (i.e. when the camera zoom out/in) to "fade out" the accumulated history faster to reduce ghosting artifact.

AO image traced at 1 ray per pixel
AO image with accumulated samples over multiple frames

But this re-projection temporal filter have a short coming that the geometry edge would failed very often (especially when done in half resolution). So in the demo, when re-projection failed, it will shift 1 pixel to perform the re-projection again to accumulate more samples.

Many edge pixels failed the re-projection test
With 1 pixel shifted, many edge pixels can be re-projected

As the result is biased, I have also reduced the sample count by a factor of 0.75 to make the correct ray-traced result "blend in" faster.

Spatial Filter
To increase the sample count for Monte Carlo Integration, we can reuse the ray-traced data in the neighbor pixels. We search in 5x5 grid and reuse the neighbor data if they are on the same surface by comparing their delta depth value (i.e. ddx and ddy generated from depth buffer). As the delta depth value is re-generated from depth buffer, some artifact may been seen on the triangle edge.

noisy AO image applied with a spatial filter
artifact shown at the triangle edge by re-constructed delta depth
To save some performance, beside using half resolution rendering, we can also choose to interleave the ray cast every 4 pixels and ray cast the remaining pixels in next few frames.

Rays are traced only at the red pixels
to save performance
For those pixels without any ray traced data during interleaved rendering, we use the spatial filter to fill in the missing data. The same surface depth check in spatial filter can be by-passed when the sample count(stored in green channel during temporal filter) is low, because it is better to have some "wrong" neighbor data than have no data for the pixel. This also helps to remove the edge artifact shown before.

Rays are traced at interleaved pattern,
leaving many 'holes' in the image
Spatial filter will fill in those 'holes'
during interleaved rendering

Also, when ray casting are interleaved between pixels, we need to pay attention to the temporal filter too. We may have a chance that we re-project to previous frame pixel which have no sample data. In this case, we snap the re-projected UV to the pixel that have cast interleaved ray in previous frame.

Bilateral Blur
To clean up the remaining noise from the temporal and spatial filter. A bilateral blur is applied, we can have a wider blur by using the edge aware A-Trous algorithm. The blur radius is adjusted according to the sample count (stored in green channel in temporal filter). So when we have already cast many ray samples, we can reduce the blur radius to have a sharper image.

Applying an additional bilateral blur to smooth out remaining noise

Random Ray Direction
When choosing the random ray cast direction, we want those chosen direction can have a more significance effect. Since we have a spatial filter to reuse neighbor pixels data, so we can try to cast rays in directions such that the angle between the ray direction in neighbor pixels should be as large as possible and also cover as much hemisphere area as possible.



It looks like we can use some kind of blue noise texture so that the ray direction is well distributed. Let's take a look at how the cosine weighted random ray direction is generated:



From the above equation, the random variable ϕ is directly corresponding to the random ray direction on the tangent plane, which have a linear relationship between the angle ϕ and random variable ξ2. Since we generate random numbers using wang hash, which is a white noise. May be we can stratified the random range and using the blue noise to pick a desired range to turn it like a blue noise pattern. For example, originally we have a random number between [0, 1), we may stratified it into 4 ranges: [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1). Then using the screen space pixel coordinates to sample a tileable blue noise texture. And according to the value of the blue noise, we scale the white noise random number into 1 of the 4 stratified range. Below is some sample code of how the stratification is done:
int BLUE_NOISE_TEX_SIZE= 64;
int STRATIFIED_SIZE= 16;
float4 noise= blueNoiseTex[pxPos % BLUE_NOISE_TEX_SIZE];
uint2 noise_quantized= noise.xy * (255.0 * STRATIFIED_SIZE / 256.0);
float2 r= wang_hash(pxPos);  // random white noise in range [0, 1)
r = mad(r, 1.0/STRATIFIED_SIZE, noise_quantized * (1.0/STRATIFIED_SIZE));
With the blue noise adjusted ray direction, the ray traced AO image looks less noisy visually:

Rays are traced using white noise
Rays are traced using blue noise
Blurred white noise AO image
Blurred blue noise AO image

Ray Binning
In the demo, ray binning is also implemented, but the performance improvement is not significant. The ray binning only show a large performance gain when the ray tracing distance is large (e.g. > 10m) and turning off both half resolution and interleaved rendering. I have only ran the demo on my GTX 1060, may be the situation will be different on RTX graphcis card (so, this is something I need to investigate in the future). Also the demo may have a slight difference when toggling on/off ray binning due to the precision difference using RGBA16Float format to store ray direction (the difference will be vanished after accumulating more samples over multiple frames with temporal filter).

Conclusion
In this post, I have described how DXR is used to compute ray-traced AO in real-time using a combination of temporal and spatial filter. Those filters are important to increase the total sample count for Monte Carlo Integration and getting a noise free and stable image. The demo can be downloaded from here. There are still plenty of stuff to improve, such as having a better filter, currently, when the AO distance is large and both half resolution and interleaved rendering is turned on (i.e. 1 ray per 16 pixels), the image is too noisy and not temporally stable during camera movement. May be I need to improve those stuff when writing a path tracer in the future.

References
[1] DirectX Raytracing (DXR) Functional Spec https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html
[2] Edge-Avoiding À-Trous Wavelet Transform for fast GlobalIllumination Filtering https://jo.dreggn.org/home/2010_atrous.pdf
[3] Free blue noise textures http://momentsingraphics.de/BlueNoise.html
[4] Quick And Easy GPU Random Numbers In D3D11 http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/
[5] Leveraging Real-Time Ray Tracing to build a Hybrid Game Engine http://advances.realtimerendering.com/s2019/Benyoub-DXR%20Ray%20tracing-%20SIGGRAPH2019-final.pdf
[6] ”It Just Works”: Ray-Traced Reflections in 'Battlefield V' https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s91023-it-just-works-ray-traced-reflections-in-battlefield-v.pdf

Reflection and Serialization

Introduction
Reflection and serialization is a convenient way to save/load data. After reading "The Future of Scene Description on 'God of War'",  I decided to try to write something called "Compile-time Type Information" described in the presentation (but a much more simple one with less functions). All my need is something to save/load C style struct (something like D3D DESC structure, e.g. D3D12_SHADER_RESOURCE_VIEW_DESC) in my toy engine.

Reflection
A reflection system is needed to describe how struct are defined before writing a serialization system. This site has many information about this topic. I use a similar approach to describe the C struct data with some macro. We define the following 2 data types to describe all possible struct that need to be reflected/serialized in my toy engine (with some variables omitted for easier understanding).



As you can guess from their names, TypeInfo is used to described the C struct that need to be reflected/serialized. And TypeInfoMember is responsible for describing the member variables inside the struct. We can use some macro tricks to reflect a struct(more can be found in the reference):

struct reflection example
The above example reflect 3 variables inside struct vec3: x, y, z. The tricks of those macro is to use sizeof(), alignof(), offsetof() and using keyword. The sample implementation can be found below:



This approach has one disadvantage that we cannot use bit field to specify how many bits are used in a variable. And bit field order seems to be compiler dependent. So I just don't use it for the struct that need to be reflected.

It also has another disadvantage that it is error-prone to reflect each variable manually. So I have written a C struct header parser (using Flex & Bison) to generate the reflection source code. So, for those C struct file that need to auto generate the reflection data, instead of naming the source code file with extension .h, we need to name it with another file extension (e.g. .hds) and using visual studio custom MSBuild file to execute my header parser. To make visual studio to get syntax high light for this custom file type, We need to associate this file extension with C/C syntax by navigate to
"Tools" -> "Options" -> "Text Editor" -> "File Extension"
and add the appropriate association:


But one thing I cannot figure out is the auto-complete when typing "#include" for custom file extension, looks like visual studio only filtered for a couple of extensions (e.g. .h, .inl, ...) and cannot recognize my new file type... If someone knows how to do it, please leave a comment below. Thank you.
MSVC auto-complete filter for .h file only and cannot discover the new type .hds

Serialization
With the reflection data available, we know how large a struct is, how many variables and their byte offset from the start of the struct, so we can serialize our C struct data. We define the serialization format with data header and a number of data chunks as following figure:

Memory layout of a serialized struct

Data Header
The data header contains all the TypeInfo used in the struct that get serialized, as well as the architecture information(i.e. x86 or x64). During de-serialization, we can compare the runtime TypeInfo against the serialized TypeInfo to check whether the struct has any layout/type change (To speed up the comparison, we generate a hash value for every TypeInfo by using the file content that defining the struct). If layout/type change is detected, we de-serialize the struct variables one by one (and may perform the data conversion if necessary, e.g. int to float), otherwise, we de-serialize the data in chunks.

Data Chunk
The value of C struct are stored in data chunks. There are 6 types of data chunks: RawBytes, size_t, String, Struct, PointerSimple, PointerComplex. There are 2 reasons to divide the chunk into different types: First, we want to support the serialized data to be used on different architecture (e.g. serialized on x86, de-serialized on x64) where some data type have different size depends on architecture(e.g. size_t, pointers). Second, we want to support serializing pointers(with some restriction). Below is a simple C struct that illustrate how the data are divided into chunks:

This Sample struct get serialized into 3 data chunks

RawBytes chunk
RawBytes chunk is a chunk that contains a group of values where the size of those variables are architecture independent. Refer to the above Sample struct, the variables val_int and val_float are grouped into a single RawBytes chunk so that during run time, those values can be de-serialized by a single call to memcpy().

size_t chunk
size_t chunk is a chunk that contains a single size_t value, which get serialized as a 64 bit integer value to avoid data loss. But loading a too large value on x86 architecture will cause a warning. Usually this type will not be used, I just add it in case I need to serialize this type for third party library.

String chunk
String chunk is used for storing the string value of char*,  the serializer can determine the length the string (looking for '\0') and serialize appropriately.

Struct chunk
Struct chunk is used when we serialize a struct that contains another struct which have some architecture dependent variables. With this chunk type, we can serialize/de-serialize recursively.
The ComplexSample struct contains a Complex struct that has some architecture dependent values,
which cannot be collapsed into a RawBytes chunk, so it get serialized as a struct chunk instead.

PointerSimple chunk
PointerSimple chunk is storing a pointer variable. And the size of the data referenced by this pointer does not depend on architecture and can be de-serialized by a single memcpy() similar to the RawBytes chunk. To determine the length of a pointer (sometimes pointer can be used like an array), my C struct header parser recognizes some special macro which define the length of the pointer (and this macro will be expanded to nothing when parsed by normal Visual Studio C/C++ compiler). Usually during serialization, the length of the pointer depends on another variable within the same struct, so with the special macro, we can define the length of the pointer like below:

The DESC_ARRAY_SIZE() macro tells the serializer that
the size depends on the variable num within the same struct

When serializing the above struct, the serializer will look up the value of the variable num to determine the length of the pointer variable data, so that we know how bytes are needed to be serialized for data.

But using this macro is not enough to cover all my use case, for example when serializing D3D12_SUBRESOURCE_DATA for a 3D texture, the pData variable length cannot be simply calculated by RowPitch and SlicePitch:

A sample struct to serialize a 3D texture, which the length of
D3D12_SUBRESOURCE_DATA::pData depends on the depth of the resources

The length can only be determined when having access to the struct Texture3DDesc, which have the depth information. To tackle this, my serializer can register custom pointer length calculation callback (e.g. register for the D3D12_SUBRESOURCE_DATA::pData variable inside Texture3DDesc struct). The serializer will keep track of a stack of struct type that is currently serializing, so that the callback can be trigger appropriately.

Finally, if a pointer variable does not have any length macro nor registered length calcuation callback, we assume the pointer have a length of 1 (or 0 if nullptr).

PointerComplex chunk
PointerComplex chunk is for storing pointer variable, with the data being referenced is platform dependent, similar to the struct chunk type. It has the same pointer length calculation method as PointerSimple chunk type.

Serialize union
We can also serialize struct with union values that depends on another integer/enum variable, similar to D3D12_SHADER_RESOURCE_VIEW_DESC. We utilize the same macro approach used for pointer length calculation. For example:
A sample to serialize variables inside union
In the above example, the DESC_UNION() macro add information about when the variable need to be serialized. During serialization, we check the value of variable type, if type == ValType::Double, we serialize val_double, else if type == ValType::Integer, we serialize val_integer.

Conclusion
This post have described how a simple reflection system for C struct is implemented, which is a macro based approach, assisted with code generator. Based on the the reflection data, we can implement a serialization system to save/load the C struct using compile time type information. This system is simple, but it does not support complicated features like C++ class inheritance. And it is mainly for serializing C style struct, which is enough for my current need.

References
[1] https://preshing.com/20180116/a-primitive-reflection-system-in-cpp-part-1/
[2] https://www.gdcvault.com/play/1026345/The-Future-of-Scene-Description
[3] https://blog.molecular-matters.com/2015/12/11/getting-the-type-of-a-template-argument-as-string-without-rtti/