Simon's Tech Blog

Studying "Spectral Primary Decomposition"

2022-11-28T11:50:00.000+08:00

Introduction

It has been a long time since my blog post (because of Covid, work, Elden Ring...). So I decided to study the "Spectral Primary Decomposition for rendering with sRGB Reflectance", which used in previous posts, to recall my memory. It is an efficient technique to up-sample sRGB texture to spectral reflectance by multiplying the sRGB values with 3 precomputed basis functions:

Overview of "Spectral Primary Decomposition" from the Explanatory Poster

In this post, I would like to find an efficient spectral up-sampling method which also support wider gamut (e.g. Display-P3...) or investigate why this technique does not support wider gamut.

Porting to Octave

In the paper, it provides sample source code written in Matlab. Since I do not have a Matlab license, so the first thing I need to do is to port the source code to the open source Octave (ported source code can be found here). During the porting process, the fmincon() used for finding the 3 spectral primary basis functions in Octave does not work, so I switched to use sqp() instead (also removed the linprog() from original source code).

Basis Functions generated in Octave

The resulting graph is not as smooth as the original paper. So I decided to try different initial value for the objective function. I chose a normalized Color Matching Function:

Basis Functions generated with normalized CMF initial value

Code for generating normalized CMF initial value

The resulting curves look smoother with normalized CMF as initial value. Also, during the porting process, I switched to use CMF2006 2 degree observer instead of CMF1931 / 2006 10 degree observer used in original source code.

Working with wider gamut

So the next step is to change the color primaries from sRGB to Display-P3 (which the original source listed as infeasible). As expected, the result is not good, not only saturated color cannot be up-sampled, the color within the sRGB gamut are not similar to the original color, and saturated red color will have an orange tint after up-sampling: (Note that below images have Display-P3 color profile attached, to view those saturated color outside sRGB gamut, a wide gamut monitor is needed)

Up-sampled saturated sRGB color

Up-sampled saturated P3 color

So, I tried to modify the objective function opt_fn() used in sqp() to include a weight to minimize the sRGB primaries color difference:

Code snippet of the objective function with sRGB primaries weight

The result improves a bit and the up-sampled saturated red has a less orange tint:

Up-sampled saturated sRGB color with sRGB primaries weight

Up-sampled saturated P3 color with sRGB primaries weight

Up to this point, all the precomputed spectral primary basis functions are within [0, 1] range (i.e. to not reflect more light in each basis function), I was wondering what if we relax this constraint and enforce this limit after linearly combining all the basis functions. I have tried to relax the range of individual basis function to [-0.05, 1.05], [-0.075, 1.075] and [-0.1, 1.1] (details can be found in the visualization website from modified source code). With the relaxed range, we can get very similar sRGB color after up-sampling:

However, for those saturated Display-P3 color, we still cannot up sample them exactly, and can only achieve slightly more saturated color compared to sRGB color:

The up-sampled saturated red is having a visible difference from the original color before up-sampling, I have tried to modify the objective function to only optimize the Red basis function (ignoring the Green and Blue basis functions), and still cannot get an exact up-sampled saturated red from a D65 light source. May be it is impossible to produce the most saturated Display-P3 red with a D65 light source without violating the physical constraint.

Out of curiosity, I tried to plot the chromaticity diagram of the up-sampled color. The result shows that, using limited [0, 1] range, the up-sampling process can produce "more color" (but not accurate, e.g. red color will be up-sampled to "orange-red"), while using relaxed constraint will reduce the up-sampled color gamut.

Chromaticity diagram of up-sampled color using limited [0, 1] range

Chromaticity diagram of up-sampled color using relaxed [-0.1, 1.1] range

CMF Reference White

Up to this point, the calculation for the up-sampled color is using D65 as reference white. But one day, I saw this tweet:

The CMF is using an equal-energy white as its reference white. So I was wondering whether all my calculation was wrong and should add chromatic adaptation after CMF integration.

So, I decided to find the spectral reflectance of color checker to integrate with the CMF to verify whether chromatic adaption are needed after CMF integration. Using the color checker data found from here, illuminating those grey patches with D65 and then integrate the result with CMF get the following results:

Illuminating grey patches with D65, integrate with CMF without CAT from Illuminant E

sRGB value of measured Color Checker (2005)

Our computed sRGB value are very similar to the measured data, so it seems like we don't need an extra chromatic adaption to adapt the color from the CMF equal-energy reference white (or please let me know if my maths are incorrect).

Optimizing up sampling function with Color Checker Data

After working with color checker data, I came up with an idea to modify the spectral basis objective function to include a weight to bias it to match with the neutral 6.5 grey patch spectral reflectance data. We can get a decent match for the up-sampled spectral reflectance of color checker grey patches (i.e. white 9.5, neutral 8, neutral 6.5, neutral 5, neutral 3.5, neutral 2).

Spectral Basis computed for Display-P3

Spectral Basis weighted with Neutral 6.5 color checker patch

Up-sampled spectral reflectance of color checker grey patches using Spectral Basis computed for Display-P3

Up-sampled spectral reflectance of color checker grey patches using Spectral Basis weighted with Neutral 6.5 patch data

However, the up-sampled white color will have a slight round-trip error:

Conclusion

In this post, I have ported the original "Spectral Primary Decomposition" source code to Octave, tried to change it to up-sample Display-P3 color, but the result is not very good. Also, within a game engine, we usually have exposure and tone mapping adjustment, which affect the final pixel color. So I was wondering whether the up-sampling method should take those parameters into account. But doing so, the texture color meaning will be different from the PBR albedo texture. So, I will leave it for future investigation.

References

[1] https://graphics.geometrian.com/research/spectral-primaries.html

[2] http://yuhaozhu.com/blog/cmf.html

[3] https://babelcolor.com/colorchecker-2.htm

Color Matching Function Comparison

2021-08-01T15:48:00.003+08:00

Introduction

When performing spectral rendering, we need to use the Color Matching Function(CMF) to convert the spectral radiance to XYZ values, and then convert to RGB value for display. Different people have a slight variation when perceiving color, and age may also affect how color are perceived too. So the CIE defines several standard observers for an average person. The commonly used CMF are CIE 1931 2° Standard Observer and CIE 1964 10° Standard Observer. Beside these 2 CMF, there also exist other CMF such as Judd and Vos modified CIE 1931 2° CMF and CIE 2006 CMF. In this post, I will try to compare the images rendered with different CMF (as well as some analytical approximation). A demo can be downloaded here (the demo renders using wavelength between [380, 780]nm, which may introduce some error with CMF that have a larger range).

Left: rendered with CIE2006 CMF
Right: rendered with CIE1931 CMF

CMF Luminance

When I was implementing different CMF into my renderer, replacing the CMF directly will result in slightly different brightness of the rendered images:

Rendered with 1931 CMF

Rendered with 1964 CMF

This is because the renderer uses photometric units (e.g. lumen, lux..) to define the brightness of the light sources. Since the definition of luminous energy depends on the luminosity function (usually the y(λ) of CMF), we need to calculate the intensity of the light source with respect to the chosen CMF. Using the correct luminosity function, both rendered images have similar brightness:

Rendered with 1931 CMF

Rendered with 1964 CMF + luminance adjustment

CMF White Point

When using different CMF, the white point of different standard illuminant will be slightly different:

White point from wikipedia

Since we are dealing with game texture, color are usually defined in sRGB with a D65 white point, we need to find the white point of the D65 illuminant for the CMF that will be tested in this post. Unfortunately, I can't find D65 white point for the CIE 2006 CMF on the internet, so I calculated it myself (The calculation steps can be found in the Colab source code):

CIE 2006 2° : (0.313453, 0.330802)
CIE 2006 10° : (0.313786, 0.331275)

But when I rendered some images with and without chromatic adaptation, the result looks similar:

1964 CMF without chromatic adaptation

1964 CMF with chromatic adaptation

So I searched on the internet, I can't find any information whether we need to chromatic adapt the rendered image due to different white point when using different CMF... May be this is because the difference is so small that applying chromatic adaptation makes no visible difference.

CIE 2006 CMF analytical approximation

The popular CIE 1931 and 1964 CMF have simple analytical approximation, such as: "Simple Analytic Approximations to the CIE XYZ Color Matching Functions" (which will be tested in this post). The newer CIE 2006 CMF lacks such an approximation. So I derived one using similar methods and the curve fitting process can be found in the Colab source code.

2006 2° lobe approximation:

2006 2° lobe approximation shader source code


black lines: exact 2006 2° CMF color lines: approximated 2006 2° CMF

2006 10° lobe approximation:

2006 10° lobe approximation shader source code

black lines: exact 2006 10° CMF
color lines: approximated 2006 10° CMF

Saturated lights comparison

With the above changes to the path tracer, we can render some images for comparison. A scene with several saturated lights using sRGB color (1,0,0), (1,1,0), (0,1,0), (0,1,1), (0,0,1), (1,0,1) is tested (which will be spectral up-sampled). 10 different CMF are used:

CIE 1931 2°
CIE 1931 2° with Judd Vos adjustment
CIE 1931 2° single lobe analytic approximation
CIE 1931 2° multi lobe analytic approximation
CIE 1964 10°
CIE 1964 10° single lobe analytic approximation
CIE 2006 2°
CIE 2006 2° lobe analytic approximation
CIE 2006 10°
CIE 2006 10° lobe analytic approximation

Here are the results:

CIE 1931 2°

CIE 1931 2° with Judd Vos adjustment

CIE 1931 2° single lobe analytic approximation

CIE 1931 2° multi lobe analytic approximation

CIE 1964 10°

CIE 1964 10° single lobe analytic approximation

CIE 2006 2°

CIE 2006 2° lobe analytic approximation

CIE 2006 10°

CIE 2006 10° lobe analytic approximation

From Wikipedia:

"The CIE 1931 CMF is known to underestimate the contribution of the shorter blue wavelengths."

So I was expecting some variation for the blue color when using different CMF. But to my surprise, only the CIE 1931 CMF suffer from the ~~“Blue Turns Purple” Problem~~ (Edited: As pointed out by troy_s on twitter, the reference I provided was wrong, the link talks about psychophysical effect, while the current issue is mishandling of light data) which we have encountered in previous posts (i.e. saturated sRGB blue light will render purple color). Originally, after previous blog post, I was investigating this issue and was suspecting the ACES tone mapper cause the color shift (as this issue does not happen when rendering in narrow sRGB gamut with Reinhard tone mapper). I was thinking may be we can use the OKLab color space to get the hue value before tone mapping and tone map only the lightness to keep the blue color. But when I tried with this approach, the hue value obtained before tone mapping is still purple color, which suggest may not be the tone mapper causing the issue (or somehow my method of getting the hue value from HDR value is wrong...). So I have no idea on how to solve the issue and randomly toggle some debug view modes. Accidentally, I found that some of the purple color are actually inside my AdobeRGB monitor display gamut (but outside the sRGB gamut on another monitor...), so the problem is not only caused by out of gamut color producing the purple shift...

The purple color on the wall is within displayable Adobe RGB gamut

Highlighting out of gamut pixel with cyan color

So I decided to investigate the problem for spectral renderer first (and ignore the RGB renderer), and that's why I tested different CMF in this blog post. (Also, as a side note, the behavior for the blue turns purple color problem is a bit different between RGB and spectral renderer, using a more saturated blue color, e.g. (0, 0, 1) in Rec2020, can hide this issue in RGB renderer while using the same more saturated blue color with 1931 CMF spectral renderer still suffer from the problem, while other CMF doesn't have this issue.)

Color Checker comparison

Next, we compare a color checker lit by a white light source. Since my spectral renderer need to maintain compatibility with RGB rendering and I was too lazy to implement spectral material using measured spectral reflectance, so both the color checker and the light source are up-sampled from sRGB color.

CIE 1931 2°

CIE 1931 2° with Judd Vos adjustment

CIE 1931 2° single lobe analytic approximation

CIE 1931 2° multi lobe analytic approximation

CIE 1964 10°

CIE 1964 10° single lobe analytic approximation

CIE 2006 2°

CIE 2006 2° lobe analytic approximation

CIE 2006 10°

CIE 2006 10° lobe analytic approximation

From the above results, different CMF have similar looks except the blue color.

Conclusion

In this post, we have compare different CMF, provided an analytical approximation for the CIE 2006 CMF and calculate the D65 white point for CIE 2006 CMF (the math can be found in the Colab source code). All the CMF produce similar color except the blue color, with CMF newer than the 1931 CMF can render saturated blue color correctly without turning it into purple color. May be we should use newer CMF instead, especially when working with wide gamut color. And the company Konica Minolta points out that: the CIE 1931 CMF has issue with wider color gamut with OLED display (which suggest to use CIE 2015 CMF instead). But sadly, I cannot find the data for CIE 2015 CMF, so it is not tested in this post.

Reference

[1] https://en.wikipedia.org/wiki/CIE_1931_color_space

[2] http://cvrl.ioo.ucl.ac.uk/

[2] http://jcgt.org/published/0002/02/01/paper.pdf

[3] https://en.wikipedia.org/wiki/ColorChecker

[4] https://en.wikipedia.org/wiki/Standard_illuminant

[5] https://www.rit.edu/cos/colorscience/rc_useful_data.php

[6] https://sensing.konicaminolta.asia/deficiencies-of-the-cie-1931-color-matching-functions/

Implementing Gamut Mapping

2021-06-26T14:42:00.006+08:00

Introduction

Continue with previous post, after learning how gamut clipping works, I want to know how it behaves in rendered image, so I implemented it in my toy path tracer with clipping to arbitrary gamut. It can be downloaded here. Also, the Shadertoy sample is updated to support clipping to arbitrary gamut.

With gamut clipping

Without gamut clipping

Solving max saturation analytically

We need to compute the maximum saturation to perform gamut clipping. In the originally gamut clipping blog post, the author relies on fitting a polynomial function for the sRGB max saturation. But for my path tracer, it can output to different color gamut (e.g. Adobe RGB, P3 D65...), I was too lazy to write such curve fitting function for arbitrary gamut, so I took a look at how the max saturation polynomial function is derived from the original Colab source code:

Luckily, when optimizing the e_R() / e_G() / e_B() function to 0, it is equivalent to solving the equation to_R() / to_G() / to_B() = 0, which is a cubic function with analytical solution:

To calculate max saturation for arbitrary gamut, we can first compute the r_dir / g_dir / b_dir for our target gamut, then compute the Oklab to target gamut matrix, finally we can solve the cubic equation to compute the maximum saturation. Details can be found in the Shadertoy sample code.

But, solving this cubic equation will have some precision issue at some hue value around the blue color, so the Shadertoy demo perform a step of Halley's method to minimize the issue. If the target clipping gamut is not large (e.g. sRGB, AdobeRGB...) Solving the cubic equation with numerical method (e.g. 1 step of Halley's method + 1 step of Newton's method) using a good initial guess (e.g. I have tried 0.4 in the Shadertoy demo) may be enough and will be more stable numerically.

The left image show the precision error for calculating the cusp point at hue 232.58 degree
The right image can calculate the cusp point correctly with < 1 degree hue difference from left image

Solving RGB=1 clipping line with 2 curves only

From previous post, we know that the upper clipping line of the valid gamut "triangle" is the line with Red/Green/Blue value = 1, and at most 2 clipping lines are used:

This yellow hue use 2 upper clipping lines (red and green lines)

In the updated Shadertoy demo, the upper "triangle" clipping method is changed to use 2 clipping lines depending on the r_dir / g_dir / b_dir (computed during max saturation).

Originally clipping code using all 3 lines

updated clipping code using 2 lines depending on hue

And during my implementation, I accidentally found that when performing gamut clipping for ACEScg color space, I forgot to calculate the chromatic adaptation due to different white point (Oklab uses D65 while ACEScg uses roughly D60), all 3 upper clipping lines need to be used:

All 3 upper clipping lines are used due to chromatic adaption bug

Result

Now, let's see how gamut clipping looks in rendered image. All 5 gamut clipping methods from Björn Ottosson's blog are implemented:

Keep lightness constant, only compress chroma (Chroma clipped)
Projection towards a single point, hue independent (L0=0.5 projection)
Projection towards a single point, hue dependent (L0=Lcusp projection)
Adaptive L0, hue independent (Adaptive L0=0.5)
Adaptive L0, hue dependent (Adaptive L0=Lcusp)

Let's start with a night scene, the clipping effect is most noticeable in the blue curtain and a slight change in the green curtain:

Without gamut clipping

Chroma clipped

Out of gamut pixels

L0=0.5 projection

Adaptive L0=0.5, α=5.0

Adaptive L0=0.5, α=0.05

L0=Lcusp projection

Adaptive L0=Lcusp, α=5.0

Adaptive L0=Lcusp, α=0.05

Then the following test scenes all use a light with saturated color (e.g. red color with (1, 0, 0) in Rec2020) to generate out of gamut color. With a saturated magenta colored light, gamut clipping can do a pretty good job at showing the details for the out of gamut area (e.g. around the lion face)

Without gamut clipping

Chroma clipped

Out of gamut pixels

L0=0.5 projection

Adaptive L0=0.5, α=5.0

Adaptive L0=0.5, α=0.05

L0=Lcusp projection

Adaptive L0=Lcusp, α=5.0

Adaptive L0=Lcusp, α=0.05

Changing to a saturated green light, different clipping methods will change the perceived lighting, especially using projection towards a single point method.

Without gamut clipping

Chroma clipped

Out of gamut pixels

L0=0.5 projection

Adaptive L0=0.5, α=5.0

Adaptive L0=0.5, α=0.05

L0=Lcusp projection

Adaptive L0=Lcusp, α=5.0

Adaptive L0=Lcusp, α=0.05

With a saturated red light, gamut clipping can greatly reduce the orange/yellow hue shift. This reminds me the presentation: "HDR in Call of Duty" and "HDR color grading and display in Frostbite", which talked about some of the VFX (e.g. fire/explosion) may relies on such hue shift. I don't know whether it is good or not, but gamut clipping may at least give a closer look between sRGB display and HDR display...

Without gamut clipping

Chroma clipped

Out of gamut pixels

L0=0.5 projection

Adaptive L0=0.5, α=5.0

Adaptive L0=0.5, α=0.05

L0=Lcusp projection

Adaptive L0=Lcusp, α=5.0

Adaptive L0=Lcusp, α=0.05

As gamut clipping can reduce hue shift for the saturated red color, I was wondering whether it can fix hue shift with blue colored light (in sRGB) showing purple which described in DXR Path Tracer post before. Unfortunately, gamut clipping can't fix this... I guess this may need to be fixed earlier in the pipeline (e.g. in tone mapper or use other gamut mapping method)...

Without gamut clipping

With gamut clipping

Out of gamut pixels

Lastly, a scene with not much saturated color, but overexposure is tested. Gamut clipping doesn't change the image much:

Without gamut clipping

With gamut clipping

Out of gamut pixels

Conclusion

In this post, an analytical solution is provided to perform gamut clipping for different gamut other than sRGB. Also different gamut clipping method are tested. "Compress chroma only" looks quite decent, while projection towards single point may change the perceived lightness of the image (depends on the lighting set up), while the adaptive method using small alpha value (e.g. 0.05) will behave similar to compress chroma only method, while with large alpha (e.g. >5.0), it will behave similar to the projection towards single point method. The demo can be downloaded to play around with different gamut clipping method. Note that the demo relies on a saturated light color to generate out of gamut color and all the albedo textures are in sRGB (due to the texture spectral up-sampling method only support sRGB while light color is using a different spectral up-sampling method). Also, my demo performs the gamut clipping before blending with UI as all the UI are in sRGB color space, in the future, I may need to think about whether the UI should be gamut clipped if wide color are used...

References

[1] https://bottosson.github.io/posts/gamutclipping/

[2] https://www.ea.com/frostbite/news/high-dynamic-range-color-grading-and-display-in-frostbite

[3] https://research.activision.com/publications/archives/hdr-in-call-of-duty

Studying Gamut Clipping

2021-05-23T18:53:00.002+08:00

Introduction

Recently, I was studying a technique called gamut clipping from this blog post. This technique is used for handling out of gamut color and bring them back to a valid range, which helps reducing hue shift and color clipping. From that blog post, it explains the concept clearly, but I was struggling to understand the sample code the author provided. So this blog post will describe what I have learnt when studying the gamut clipping source code. Also, I have written a Shadertoy sample to help me to understand the problem.


Drawing the valid sRGB color with Oklab chroma as horizontal axis Oklab lightness as vertical axis with hue value displayed at upper right corner

Overview of the gamut clipping

This section briefly describe the steps to perform gamut clipping, feel free to skip it if you have read the original gamut clipping blog post already. The technique first start to convert the out of gamut color (e.g. those pixels with values >1.0 or <0.0 in sRGB) into the Oklab color space (which can be expressed with lightness, hue and chroma). Then we try to project this out of gamut color along a straight line to the"triangular" gamut boundary (like below picture). To calculate this intersection, we need to find the cusp coordinates at this particular hue slice. The author use curve fitting to approximate cusp coordinates with polynomial equation. With the cusp coordinates calculated, the intersection point can be found with numerical approximation using Halley's method. Finally that intersection point can be converted back to sRGB color space.


Showing an out of gamut color point being projected back to a valid value along a straight line.	Displaying all color including those sRGB clipped color, out of gamut color will result in hue shift (e.g. blue hue displayed as purple).

Finding maximum saturation

To approximate the cusp, instead of using the hue value and returning the cusp (chroma, lightness) coordinates directly. The author tries to use the maximum saturation value to find the cusp coordinates. This is the first thing I don't understand when reading the source code. Why saturation is related to the cusp? To understand this, we can start from the definition of saturation first. Both terms chroma and saturation describe colorfulness, but saturation is "somehow normalized" and "not affected" by the brightness. A chroma/lightness/saturation relationship can be defined like this:

Saturation definition for CIELAB space from Wiki,
we use the same definition for Oklab space too.

Reading the author's Colab source code, he tried to approximate the max saturation with a polynomial function using a=cos(hue) and b=sin(hue):

Then these (saturation, hue) values will be converted to linear sRGB space and minimize it for R=0, G=0 and B=0, which obtained 3 polynomial equations to approximate the maximum saturation.

But why we can obtain maximum saturation when either R / G / B=0? To have a rough understanding of it, I decided to open a color picker with HSL slider to "simulate" how the RGB values changes:


A color picker with maximum saturation and changing hue slowly. There will always be a 0 in the RGB value.

First I choose the most saturated red color (255, 0, 0) in sRGB, which yield a HSL value (0, 240, 120). Then I change the hue value slowly to observe how the RGB value changes. From the above animated gif, we can see that there will always be a 0 in either R or G or B channel. So I guess the author is using this property and the Oklab space has similar property.

To further understand how the saturation looks like for all hue slices in Oklab, I plotted the saturation value, seems like it always have the largest value at the lower "triangle" edge:

We can also draw the line for those pixels with R/G/B value = 0:

The lower "triangle" edge is actually the "clipping" lines when R=0 or G=0 or B=0 (and switching between these 3 lines)! That's why the author tried to use 3 different polynomials to approximate the max saturation. Then the next problem is how to pick one of the three approximated polynomials. From the above animated gif, we know that the "clipping line" changes when sRGB values = (1, 0, 0), (0, 1,0) or (0, 0, 1). We can see this from the Colab sample code which find the r_dir/g_dir/b_dir, and use this to pick one of the three approximated polynomials.

The r_dir/g_dir/b_dir is the "half vector" between the "opposite hue" like the below figure. And this direction vector can be used with dot product to check whether a hue value (in form of cos(hue), sin(hue)) is lying within which r_h/g_h/b_h range.

Visualizing the r/g/b_dir' vector (before multiplying the scalar constant for fast dot product check).

The formula for checking which hue range the a, b hue value is in, which can be derived from dot product.

Another problem I don't understand is: when the author tried to minimize the RGB value to 0, he raised the error value to the power 10 in the e_R(x)/e_G(x)/e_B(x) function.

When I tried to throw random values to the initial guess value in scipy.optimize.minimize() (e.g. use all 0 or 1 as initial guess), the resulting approximated curve is not that good...

However when I changed the pow(x, 10) to abs(x), using random initial guess values can have a more predictable result (though the error is not as small as the author's approximation). May be I will use abs(x) instead to optimize for different color spaces in the future.

Finding cusp from maximum saturation

With the above polynomial approximation, we can find the "lower clipping line" of the gamut "triangle" (i.e. with either R/G/B value = 0). The cusp must be on this "clipping line". We only need to identify the lightness value at the cusp. Looking at the author's find_cusp() function, it seems like the cusp will have R or G or B value equals to 1:

To have a better understanding, we can repeat the above color picker "experiment", when changing the hue value at maximum saturation, beside there always having a 0 value in R/G/B, there will always having a 255 value in R/G/B too!

The same color picker gif as above,
pay attention to the RGB values,
in addition to always having a 0 value,
there is a 255 value too.

Plotting the line when R/G/B value = 1 for all hue slices in Oklab space:

When R/G/B value = 1, it is the upper clipping line of the "gamut triangle"! So the cusp is the intersection between lower clipping line with R/G/B=0 and the upper clipping line with R/G/B=1.

So, to obtain the maximum lightness and maximum saturation, we can scale the lightness so that when converted back to sRGB, one of the RGB value will = 1 (taking cubic root is needed because of the Oklab color space LMS definition).

Finding gamut intersection

With the gamut cusp coordinates, target project coordinates and out of gamut coordinates, we can find the intersection coordinates at the gamut boundary during projection. First we need to check whether the intersection is in the upper half or the lower half of the valid range gamut "triangle". This can be determined by below formula which can be derived from checking whether the projection line is on the left/right side to the cusp line using cross product.

If in the lower half, it is just a simple line-line intersection. If in the upper half, we first approximate the intersection with line-line intersection, and then refine the answer with Halley's method. Since we know the upper clipping line is R/G/B=1, we can see the author is using this property when using Halley's method in the following code snippet:

and the author use all 3 clipping curves and take the minimum value to select the closest curves that clip the gamut.

If small error is acceptable, we can use 1 clipping curve instead of 3. The upper clipping curve changes happens roughly at RGB value=(0, 1, 1), (1, 0, 1) and (1, 1, 0). We can use the same method as picking the R/G/B=0 curve during maximum saturation calculation which relies on the r_dir, g_dir, b_dir. We can compute the c_dir, m_dir and y_dir (which correspond to the cyan, magenta and yellow direction). Those coefficients can be found in my Shadertoy source code. Because the upper clipping curve is not straight line, we may need 2 clipping lines to compute the correct answer for some hue values:

Need 2 upper clipping lines for this hue slice

Conclusion

In this post, I have described the process of learning the gamut clipping technique. With the help of the Shadertoy sample, we can see that the gamut boundary is the line with R/G/B=0 and R/G/B=1. And the gamut cusp is the intersection between R/G/B=0 line and R/G/B=1 line. But I still have some questions left in my mind like: Is it meaningful to use negative chroma values? Is the gamut clipping works well in pratice, or need a gamut compression method? May be I will implement gamut clipping in my toy path tracer to see how it looks like.

Zooming out the graph, do the negative values (e.g. -ve chroma, -ve lightness) have meaning?
The G=1 clipping line "wrap around" to negative lightness, does it have meaning too?

Reference

[1] https://bottosson.github.io/posts/gamutclipping/

[2] https://bottosson.github.io/posts/oklab/

[3] https://en.wikipedia.org/wiki/Colorfulness#CIELUV_and_CIELAB

Importance sampling visible wavelength

2021-03-26T01:01:00.015+08:00

Introduction

It has been half a year since my last post. Due to the pandemic and political environment in Hong Kong, I don't have much time/mood to work on my hobby path tracer... And until recently, I tried to get back to this hobby, may be it is better to work on some small tasks first. One thing I am not satisfied in previous spectral path tracing post is using 3 different cosh curves(with peak at the XYZ color matching functions) to importance sample the visible wavelength instead of 1. So I decided to revise it and find another PDF to take random wavelength samples. A demo with the updated importance sampled wavelength can be downloaded here and the python code used for generating the PDF can be viewed here (inspired by Bart Wronski's tweet to use Colab).

First Failed Try

The basic idea is to create a function with peak values at the same location as the color matching functions. I decided to use the sum of the XYZ color matching curves as PDF.

Black line is the Sum of XYZ curves

To simplify calculation, analytical approximation of the XYZ curves can be used. A common approximation can be found here, which seems hard to be integrated(due to the λ square) to create the CDF. So a rational function is used instead:

The X curve is approximated with 1 peak and dropped the small peak at around 450nm, because we want to compute the "sum of XYZ" curve, that missing peak can be compensated by scaling the Z curves. The approximated curves looks like:

Light colored curves are the approximated function

Grey color curve is the approximated PDF

The approximated PDF is roughly similar to the sum of XYZ color matching curves. But I have made a mistake: Although the rational function can be integrated to create the CDF, I don't know how to compute the inverse of CDF(which is needed by the inverse method to draw random samples using uniform random numbers). So I need to find another way...

Second Try

From previous section, although I don't know how to find the inverse of the approximated CDF, out of curiosity, I still want to know how the approximated CDF looks like, so I plot the graph:


Black line is the original CDF Grey line is the approximated CDF

It looks like some smoothstep functions added together, with 1 base smoothstep curve in range [380, 780] with 2 smaller smoothstep curves (in range around [400, 500] and [500, 650]) added on top of the base curve. May be I can approximate this CDF with some kind of polynomial function. After some trail and error, an approximated CDF is found (details of CDF and PDF can be found in the python code):

The above function divides the visible wavelength spectrum into 4 intervals to form a piecewise function. Since smoothstep is a cubic function, adding another smoothstep function is still another cubic function, which can be inverted and differentiated. And the approximated "smoothstep CDF/PDF" curve looks like:

Blue line is the approximated "smoothstep CDF"
Black line is the original CDF

Blue line is the approximated "smoothstep PDF"
Black line is the original PDF

Although the "smoothstep CDF" looks smooth, its PDF is not (not C1 continuous at around 500nm). But it has 2 peaks value at around 450nm and 600nm, may be let's try to render some images to see how it behaves.

Result

The same Sponza scene is rendered with 3 different wavelength sampling functions: uniform, cosh and smoothstep (with stratification of random number disabled, color noise are more noticeable when zoomed in):

uniform weighted sampling

cosh weighted sampling

smoothstep weighted sampling

Both cosh and smoothstep wavelength sampling method show less color noise than uniform sampling method, with the smoothstep PDF slightly better than cosh function. Seems the C1 discontinuity of the PDF does not affect rendering very much. A demo can be downloaded here to see how it looks in real-time.

Conclusion

This post described an approximated function to importance sample the visible wavelength by using the sum of the color matching functions, which reduce the color noise slightly. The approximated CDF is composed of cubic piecewise functions. The python code used for generating the polynomial coefficients can be found here (with some unused testing code too. e.g. I have tried to use 1 linear base function with 2 smaller smoothstep functions added on top, but the result is not much better...). Although the approximated PDF is not C1 continuous, it does not affect the rendering very much. If someone knows more about how the C1 continuity of the PDF affect rendering, please leave a comment below. Thank you.

References

[1] https://en.wikipedia.org/wiki/CIE_1931_color_space#Color_matching_functions

[2] Simple Analytic Approximations to the CIE XYZColor Matching Functions

[3] https://stackoverflow.com/questions/13328676/c-solving-cubic-equations

sRGB/ACEScg Luminance Comparison

2020-09-30T01:47:00.000+08:00

Introduction

When I was searching information about rendering in different color spaces, I came across that using wider color primaries (e.g. ACEScg instead of sRGB/Rec709) to perform lighting calculation will have a result closer to spectral rendering. But will this affect the brightness of the bounced light? So I decided to find it out. (The math is a bit lengthy, please feel free to skip to the result.)

Comparison method

To predict the brightness of the rendered image, we can consider the reflected light color after n bounces. To simplify the problem, we assume all the surfaces are diffuse material. We can derive a formula for the RGB color vector c after n light bounces.

To calculate lighting in different color spaces, we need to convert the albedo and initial light color to our desired color gamut by multiplying with a matrix M (For rendering in sRGB/Rec709, M is identity matrix).

Finally, we can calculate the luminance of the bounced light by computing the dot product between the color vector c and luminance vector Y of the color space (i.e. Y is the second row vector of the conversion matrix from a color space to XYZ space, with chromatic adaptation).

Now, we have an equation to compute the brightness of a reflected ray after n bounces in arbitrary color space.

Grey Material Test

To further simplify the problem, we assume all the surfaces are using the same material:

Then assuming all the surfaces are grey in color:

Now, the luminance equation is simpler to understand.

Substituting the matrix M and luminance vector Y for sRGB color gamut, the equation is very simple:

Then we do the same thing for ACEScg. Surprisingly, there are some constants roughly equal to one, so we can approximate them with constant one and the result will roughly equal to the luminance equation of sRGB.

As both equations are roughly equal, the rendered images in sRGB and ACEScg should be similar. Let's try to render images in sRGB and ACEScg to see the result (images are path traced with sRGB and ACEScg primaries, and then displayed in sRGB).

Path traced in sRGB

Path traced in ACEScg

Both images looks very similar! So rendering in different color spaces with grey material will not change the brightness of the image. At least, the difference is very small after tone mapped to a displayable range.

Red Material Test

Now, let's try to use red material instead of grey material to see how the luminance changes (where k is a variable to control how 'red' the material is):

But the equation is still a bit complex, so we further assume the initial light color is white in color.

Then we perform the same steps in last section, substituting M and Y into luminance equation.

sRGB luminance equation

ACEScg luminance equation

Unfortunately, both equations are a bit too complex to compare, having 2 variables k and n... May be let's try to plot some graphs to see how those variables affect the luminance, with number of bounced light = 3 and 5 (i.e. n=3 and n=5, skipping the N dot L part because both equations have such term). From the graphs below: when k increases (i.e. the red color is getting more saturated, with RGB value closer to (1, 0, 0) ), luminance difference will increase, hence sRGB will have a larger luminance value than ACEScg.

Then comparing the images rendered in sRGB and ACEScg:

Path traced in sRGB

Path traced in ACEScg

The indirectly lit area looks much brighter when rendered in sRGB. This makes sense because for any red color, its red channel value will be closer to one (while green/blue values will be closer to 0) when represented in sRGB compared to be represented in ACEScg. After several multiplication, the reflected light value should be larger when computation is done in sRGB.

RGB Material Test

How about using different colored material this time? Assuming 1/3 of the light bounced on surface that is red material, 1/3 is green material, and 1/3 is blue material.

Like previous 2 sections, substituting M and Y, the luminance equation becomes:


sRGB luminance equation	ACEScg luminance equation

And then plotting graphs to see how k and n varies:

The result is different this time. The sRGB luminance is smaller than ACEScg luminance, and the difference increases when both k and n increases. So the bounced light will be darker when rendered in sRGB.

Let's try rendering some images to see whether this is true. Although we cannot force the ray to bounce with 1/3 red/green/blue material exactly, I roughly assigned 1/3 of the material in red/green/blue color in the scene.

Path traced in sRGB

Path traced in ACEScg

From the screen shots above, the indirectly lit red material looks darker when rendered in sRGB (especially the curtains on the ground floor), while the differences for the blue and green material are small. We can think of the result like previous section, for a given red color, when it is represented in sRGB, its red channel value is closer to one, however, its blue and green channels values are closer to 0 compared to be represented in ACEScg (same for both blue and green material). So after several multiplication with different color material, the RGB values in sRGB may becomes closer to 0 because different color material cancel out each other (e.g. when light bounced on red and green albedo surface (1, 0, 0) , (0, 1, 0) in sRGB, the reflected light will be zero, while the same color represented in ACEScg, the color will not be "zeroed out"), resulting in darker image.

Conclusion

After testing with different assumptions, the brightness of images when rendered in sRGB can be darker, roughly equal or brighter than rendered in ACEScg. The brightness difference depends on material used in the scene. If the scene uses grey material only, brightness will be equal. If material has similar color (e.g. all red material), sRGB image will be brighter. If the scene has more color variation, the sRGB image may becomes darker. And turns out this conclusion can be arrived without doing such lengthy math. We can think of the same color value represented in sRGB and ACEScg space: Is the RGB value closer to 0 or 1 when represented in the color space? Will the RGB values 'cancel' each other when performing lighting calculation in the color space? So I was too stupid to figure out this simple answer early and instead worked on such lengthy math... >.<

Reference

[1] https://chrisbrejon.com/cg-cinematography/chapter-1-5-academy-color-encoding-system-aces/

Spectral Path Tracer

2020-07-11T17:51:00.000+08:00

Introduction
Before starting this post, I would like to talk a bit about my homeland, Hong Kong. The Chinese government enacted a new National Security Law, bypassing our local legislative council. We can only read the full text of this law after it is enacted (with official English version published 3 days after that). This law destroys our legal system completely, the government can appoint judges they like (Article 44), jury can be removed from trial (Article 46), and without media and public presence (Article 41). This law is so vague that the government can prosecute anyone they don't like. People were arrested due to processing anti-government stickers. We don't even have the right to hate the government (Article 29.5). If I promote "Boycott Chinese Products", I may broke this law already... Also the personnel of the security office do not need to obey HK law (Article 60). This law even applies to foreigners outside HK (Article 38). Our voting right is also deteriorating, more pro-democracy candidates can be disqualified by this law in the upcoming election (Article 35)... So, if you are living in a democratic country, please cast a vote if you can.

Back to the topic of spectral path tracer. Path tracing in the spectral domain has be added in my toy renderer (along side with tracing in sRGB / ACEScg space). Spectral path tracer trace rays with actual wavelength of light instead of RGB channels. The result is physically correct, and some of the effect can only be calculated by spectral rendering (e.g. dispersion, iridescence). Although my toy spectral path tracer does not support such material, I would like to investigate how spectral rendering affects the bounced light color compared to image rendered with RGB color spaces. The demo can be downloaded here.

Spectral rendered image

Render Loop Modification
Referencing my previous DXR Path Tracer post, only a few modification need to be done to support spectral path tracing:

RGB path tracer render loop in previous post

When start tracing new rays, a wavelength is randomly picked. My first implementation uses hero wavelength with 3 wavelength samples per ray. The number 3 is chosen because it is convenient to replace existing code where rays are traced with RGB channels. So the "Light Path Texture" in previous post is modified to accumulate the energy at that 3 wavelengths during ray traversal. Finally, when the ray is terminated, the resulting energy in the "Light Path Texture" will be integrated with the CIE XYZ color matching function and stored in the "Lighting Sample Texture" in XYZ space, which later will be converted to the display device space (e.g. sRGB/AdobeRGB/Rec2020) as described in previous post.

Spectral Up Sampling Texture
One of the problem in spectral rendering is to convert texture from color to spectral power distribution(SPD), this process is called spectral up sampling. Luckily, there are many paper talks about it. The technique called "Spectral Primary Decomposition for Rendering with sRGB Reflectance" is used in the demo to up sample texture. I choose this method because of its simplicity. This method can reconstruct the spectrum by using linear combination of the texture color with 3 pre-computed spectral basis function:

But one thing bothered me is that, the meaning of texture color is a bit different than the spectral up sampling method. In PBR rendering, texture color is referring to albedo (i.e. ratio of radiosity to the irradiance received by a surface.), which is independent of CIE XYZ observer. While the up-sampling method is trying to minimize the least squares problem of the texture color viewed under illuminant D65 with CIE standard observer. May be the RGB albedo values are computed with SPD and XYZ observer function? I have no idea about it and may investigate about this in the future.

Spectral Up Sampling Light Color and Intensity
Beside spectral up sampling the texture, light also need to be up sampled. Because the light color can be specified in wide color in the demo, the up sampling method used in above section is not enough. The method "A Low-Dimensional Function Space for Efﬁcient Spectral Upsampling" is used to up sample the light color. This method compute 3 coefficients using light color (i.e. from RGB to c0, c1, c2), and then the spectral power distribution, f (λ), can be computed as below:

Since, light is specified by color and intensity, after calculating the SPD coefficients, we need to scale the SPD curve, so that when integrating the scaled SPD with the CIE standard observer ȳ(λ) curve, the result should equals to the specified luminance intensity :

The scaling factor K is calculated numerically using Trapezoidal rule with 1nm wavelength interval and the ȳ(λ) curve is approximated with "multi-lobe approximation in "Simple Analytic Approximations to the CIE XYZColor Matching Functions". So, the light spectral power distribution is specified by 4 floating point numbers: 3 coefficients + 1 intensity scale.

In the demo, the original light intensity of RGB path tracer is modified so that it better matches the intensity of the spectral rendered image. Before the modification, the RGB lighting is done by simply multiplying the light color with light intensity. But now, this value is also divided by the luminance of the color (but this lose some control in the color picker UI...).

RGB light color multiply with
intensity only

RGB light color multiply with intensity
divided by color luminance

Spectral light with scaled SPD curve

In addition to the luminance scale, we also need to chromatic adapt the light color from illuminant E to illuminant D65/D60 before computing the 3 SPD coefficients, because the coefficients are fitted using illuminant E. If not doing this, the image will have a reddish appearance.

Computing light SPD coefficients without chromatic adaption

Computing light SPD coefficients with chromatic adaption

Importance Sampling Wavelength
As mention at the start of the post, the wavelengths are sampled using hero wavelength, which randomly pick 1 wavelength within the visible spectrum (i.e. 380-780nm in the demo), and then 2 additional samples are picked, which evenly separated within visible wavelength range. With this approach, there is a high variance in color. Sometimes with 100 samples per pixel, the color converge to the final color, but more often, it requires > 1000 samples per pixel to converge. It really depends on luck...


3 different spectral rendered images with hero wavelength using 100 samples per pixel, The color looks a bit different between all 3 images with a noticeable red tint in the middle image.

To make the render converge faster, let's consider the CIE XYZ standard observer curves below, those samples with wavelength >650nm and <420nm will only have a few influence on the output image. So I tried to place more samples around the center of visible wavelength range.

CIE 1931 Color Matching Function from Wikipedia

My first failed attempt is to use a cos weighted PDF curve like this to randomly pick 3 wavelengths for each ray:

A normalization constant is computed so that the PDF is integrated to one, and then CDF can be computed. To pick a random sample from this PDF, inverse method can be used. To simplify the calculation, the PDF is centered at 0 with width 200 instead of [380, 780] range. After sampling λ from the inverse of CDF, the λ is shifted by 580 to make it lies in [380, 780] range. To find the inverse of CDF:

Compute inverse CDF of the cos weighted PDF (with w=200)

Unfortunately, this cannot be inverted analytically as mentioned here. So Newton's method(with 15 iterations,) is used as suggested from this post, which have the follow result:


3 different spectral rendered images with cos-weighted PDF using 100 samples per pixel, The color still looks a bit different between all 3 images...

Sadly, the result is not improved, which gives more color variance than the hero wavelength method...

So I google for a while and found another paper: "An Improved Technique for Full Spectral Rendering". It suggests to use the cosh function for PDF, which its CDF can be inverted analytically:

The paper only suggest to use that PDF curve with center B= 538nm and A= 0.0072. Since this shape is similar to my cos weighted PDF, the color converge rate is similar (so I just skip capturing screen shots for this case)... But what if we use this curve with their center lying around the peak of the XYZ standard observer curve? To find this out, I tried to find the normalization constant within range [380, 780]nm, and then compute the CDF and inverse CDF:

By using 3 different PDF to sample the wavelengths (A0=0.0078, B0= 600, A1= 0.0072, B1= 535, A2= 0.0062, B2= 445 are used in the demo), the image converge much faster than using hero wavelength. Using about 100 SPP will often enough to get a similar color to the converged image.

Rendered with 3 different cosh-curves PDF using 100SPP

Converged spectral rendered image.

Another problem with the color variance in hero wavelength is the camera movement. Since my demo is an interactive path tracer, when the camera moves, the path tracer re-generate a wavelength sample which change the color greatly every frame:

Camera movement with hero wavelength sampling

To give a better preview of color during first few samples of the path tracing. The random numbers are stratified into 9 regions so that the first ray will pick 3 random wavelengths lying around 600nm, 535nm and 445nm when substituted into the inverse CDF of cosh weighted curves, which will give some Red, Green, Blue color.

Code to generate stratified random numbers P0, P1, P2 within [0, 1] range.

With this stratified random number, color variation is reduced during camera movement:

Camera movement with stratified random numbers.

Conclusion
In this post, I have described how a basic spectral path tracer can be implemented. The spectral rendered image is a bit different from the RGB rendered image (The RGB rendered image is a bit more reddish compared to the spectral traced one.). This may be due to the spectral up sampling method used, or not using a D65 light source. However, the bounced light intensity is not much different between tracing in Spectral and ACEScg space. In the future I would like to try using different light source such as illuminant E/D/F to see how it affects the color. I would also like to have a technique to spectral up sampling albedo with wide color gamut instead of sRGB color only.

References
[1] https://cgg.mff.cuni.cz/~wilkie/Website/EGSR_14_files/WNDWH14HWSS.pdf
[2] https://en.wikipedia.org/wiki/CIE_1931_color_space
[3] https://graphics.geometrian.com/research/spectral-primaries.html
[4] https://rgl.s3.eu-central-1.amazonaws.com/media/papers/Jakob2019Spectral_3.pdf
[5] http://jcgt.org/published/0002/02/01/paper.pdf
[6] https://www.researchgate.net/publication/228938842_An_Improved_Technique_for_Full_Spectral_Rendering

HDR Display

2020-04-22T03:30:00.001+08:00

Introduction
Continue with the DXR Path Tracer in the last post, I updated the demo to support HDR display. It also has various small features updated such as adding per monitor high DPI support, path trace resolution scale (e.g. path trace at 1080, and bilinear upscale to 4K.) and added dithering to tone mapped output to reduce color banding (Integer back buffer format only). The updated demo can be downloaded here.

A photo of my HDR TV to show the difference between SDR and HDR

HDR Color Spaces
There are 2 swapchain formats/color spaces can be chosen to output HDR images on HDR capable monitor/TV:

1. Rec 2020 color space (DXGI_FORMAT_R10G10B10A2_UNORM + DXGI_COLOR_SPACE_RGB_FULL_G2084_NONE_P2020)
2. scRGB color space (DXGI_FORMAT_R16G16B16A16_FLOAT + DXGI_COLOR_SPACE_RGB_FULL_G10_NONE_P709)

Rec2020 color space is the common HDR10 format with PQ EOTF. But it is recommended to use scRGB color space on Windows. scRGB use the same color primaries at Rec709, which support wide color using negative values. I was confused about using negative values to represent a color (as well as intensity) at first. Because color gamut is usually displayed as CIE xy chromaticity diagram, I used to think of a given RGB value as a color interpolated inside the Red/Green/Blue gamut triangle using the barycentric coordinates.

A debug chromaticity diagram showing Rec2020 color gamut.
Using a HDR backbuffer will show less clipped color

Although, it makes sense to represent a wide color using negative numbers using barycentric interpolation, I was confused how it to represent the intensity at the same time. It is because the chromaticity diagram skipped the luminance information. So, instead of thinking color inside the horseshoe-shaped diagram, it is easier for me to think about the color in 3D XYZ color space. The Red/Green/Blue color primaries of a gamut is 3 basis vectors in the XYZ color space. A linear combination of RGB values with the color primaries basis vectors can represent a color and intensity. Thinking in this way make me feel more comfortable. So the path traced lighting values are fed into ACES HDR tone mapper, transformed into Rec709 color space, and then divided by 80 when using scRGB color space(scRGB requires value of 1 to represent 80 nit).

HDR10 metadata
I have also played around with HDR10 metadata to see how it affects the image. But most of the data does not affect the image on my Samsung UA49MU7300JXKZ TV. The only data have effects on the image is "Max Mastering Luminance" which can affect the image brightness. Setting it to a small value will make image darker despite outputting a very bright 1000 nit image. Also, the HDR10 metadata only works in Full Screen Exclusive mode, or borderless windowed mode with HWND_TOPMOST Z order (I guess is the full screen optimization get enabled), using borderless window with HWND_TOP Z order won't work (but this mode is easier to alt-tab on single monitor set up...). Besides, entering Full Screen Exclusive mode may fail when calling SetFullscreenState() if the display is not connected to the adapter that used for rendering. I didn't notice this until I started to work on laptop which use the RTX graphics card for ray tracing and the laptop monitor is connected to the Intel graphics card. Looks like some hard work need be done in order to support Full Screen Exclusive mode properly (e.g. create a D3D Device/Command Queue/Swapchain and for the Intel graphics card and copy the ray traced image from the RTX graphics to Intel graphics card for full screen swapchain). But unfortunately, my demo does not support multi-adapter, so the HDR10 metadata may not work on such setup (I am outputting to the external HDR TV using the RTX graphics card, so that doesn't create a much problem for me...)...

Capability of my HDR TV

UI showing all the adjustable HDR10 meta in the demo

UI
Blending SDR UI with HDR image is handled in 2 steps: blend color and then brightness. All the UI is rendered into an off screen buffer (in 8 bit Rec709 color space). And then later blended with the ACES tone mapped image. Take a look at the ACES tone mapping function snippet below, the lighting value will be mapped to a normalized range in AP1 color space as an intermediate step (red part in the below code snippet). So in the demo, the UI in off screen buffer will be converted to AP1 color space and then blend with the tone mapped image at this step. (I have also tried blending in XYZ space and the result is similar in the demo.)

ACES tone map function snippet

Then the UI blended image can be transformed into target color space (e.g. scRGB or Rec2020, purple part in the above code snippet). When converting the normalized color data to HDR data, the ACES tone mapper interpolates the RGB values between Y_MIN and Y_MAX (i.e. blue part in the above code snippet). During this brightness interpolation, the demo adjust Y_MAX (e.g. 1000 to 4000 nits) to user defined UI brightness (e.g. 80 to 300 nits) depending on the UI alpha, using the following formula:

In the demo, BlendPow is set to 5 as default value. Although the result is not perfect (e.g. the UI alpha value may not looks blending linearly, depending on background luminance), it works well enough to avoid bleeding though from a bright background with UI alpha > 0.5:

Background Luminance: 0-10 nit

Background Luminance: 2 - 250 nit

Background Luminance: 1000 nit

Photos showing UI blending with HDR background, from dark to bright.

However, the above blending formula have artifact when Y_MAX is smaller than the UI brightness (But this may not happens and only happens in some debug view mode). In this case, the background may looks too bright after blending with the UI. To fix this, inverting the BlendPow may helps to minimize the artifact:

The below is an extreme example showing the artifact with Y_MAX set to 1 nit and UI set to 80 nit:

Without the fix, the artifact is very noticeable at the upper right corner

With the fix, the background is darkened properly

Looking back to the blue part of the ACES tone mapping function in the above code snippet, I was wondering will the image looks different as the Y_MIN and Y_MAX values are interpolated in the display color space (e.g. will the image looks different when interpolating in scRGB / Rec2020). To compare whether these 2 color are the same, we need to have both color in the same space. So let's convert the interpolated RGB value back to XYZ space:

So, interpolating in different display color space do have some differences as long as Y_MIN != 0. But The ACES tone mapper which defaults to use the STRECH_BLACK option which effectively setting Y_MIN= 0, so there should be no difference when interpolating values in difference color space. Out of curiosity, I have tried to disable the STRECH_BLACK option to see whether the image will look different when switching between the scRGB and Rec2020 back buffer with Y_MIN > 0, but the images still looks the same with a large Y_MIN... I am not sure why this happens, may be the difference is too small to be noticeable... In the demo, I take this into account and treat the Y_MIN as a value in the XYZ color space and interpolate the value like this:

Debug View Mode
To help debugging, several debug view modes are added. A "Luminance Range" view mode is used to display the luminance value of a pixel before tone mapping (ie. the physical light luminance entering the virtual camera) and after tone mapping (i.e. the output light luminance the monitor should be displayed):

Showing pixel luminance before tone mapping.

Showing pixel luminance after tone mapping.

A "Gamut Clip Test" mode to high light pixel that fall outside Rec709 gamut, i.e. those color can only be viewed with a HDR display or wide color monitor (e.g. AdobeRGB / P3 monitor).

Highlight clipped pixel with cyan color

A "SDR HDR Split" mode to compare the SDR/HDR images. But this mode can only be a rough preview of how SDR will look like on HDR back buffer. It is because SDR images are usually displayed brighter than 80 nit, the SDR part need to be brighten up (say to 100-200 nit) to make it look similar to using a real SDR back buffer. Also, in this view mode, because the HDR back buffer expect a pixel value in nit (either linear or PQ encoded), I don't apply any gamma curve to the SDR portion of the image, which may also result in differences to a real SDR version too.

A photo showing both SDR and HDR at the same time.
Bloom is exaggerated to show the clipped color in SDR

Conclusion
In this post, I have talked about the color spaces used for outputting to HDR display. No matter which color space format is used (e.g. scRGB/ Rec2020), the displayed image should be identical if transformed correctly (except some precision difference). Also, I have tried to play around with the HDR10 metadata, but most of the metadata does not change the image on my TV... I guess how the metadata is interpreted is device dependent. Lastly, SDR UI is composited with the HDR image by first blending with the color and then brightness. A simple blending formula is enough for the demo. A complicated algorithm can be explored in the future, say brighten up the UI depending on the brightness of a blurred HDR background (e.g. may be storing the background luminance in the alpha channel and blur it together with bloom pass?). A demo can be downloaded here to test on your HDR display (Some of the options are hid if not connecting to HDR display).

References
[1] https://channel9.msdn.com/Events/Build/2017/P4061
[2] https://www.pyromuffin.com/2018/07/how-to-render-to-hdr-displays-on.html
[3] https://www.gdcvault.com/play/1024803/Advances-in-the-HDR-Ecosystem
[4] https://www.gdcvault.com/play/1026443/Not-So-Little-Light-Bringing
[5] https://developer.nvidia.com/implementing-hdr-rise-tomb-raider
[6] https://www.asawicki.info/news_1703_programming_hdr_monitor_support_in_direct3d
[7] https://onedrive.live.com/?authkey=%21AFU3moSbUzyUgaE&cid=A4B88088C01D9E9A&id=A4B88088C01D9E9A%21170&parId=A4B88088C01D9E9A%21106&o=OneUp
[8] https://www.shadertoy.com/view/4tV3WW

DXR Path Tracer

2020-03-14T16:47:00.000+08:00

Introduction
Can't believe it has been half a year since my last DXR AO post. It was a hard time in Hong Kong last year, but thanks to the social unrest and medical worker's strike, the Wuhan Coronavirus does not spread widely in the local community (but still have new cases everyday...). Due to the virus, it is better to stay at home, so I continue to code my path tracer. This new path tracer is unbiased by terminating rays with Russian Roulette. During path tracing, physical light unit are used. Also, rendering can be done in wide color space. Finally, the path traced result is tone mapped and output to sRGB/wide color gamut depends on the display device. A demo can be downloaded here (Please use latest graphics driver to run as I have encountered device remove hang on my laptop RTX2060 with old driver, but not on my desktop GTX1060... If the crash/hang still happens, please let me know. Thank you.).

Path Traced Sponza scene.

Render Loop
At the start of the demo (or after any camera movement/lighting changes), a structured buffer, Ray Buffer, is initialized with 1 ray per pixel using the camera transform.

The struct stored in Ray Buffer, not tightly packed for easier understanding.

Then a ray generation shader is dispatched to read the Ray Buffer and trace rays into the scene. Lighting is calculated and generate new Ray Buffer elements if the rays are not terminated and continue the path tracing in the next frame. Below is a simplified flow of the rendering operation executed every frame (with render passes on the left and resources on the right):

A simplified path tracing flow executed every frame

Let's start with the usage of resources in the above flow chart first:

Ray Buffers are structured buffers storing the RayData struct. A ray will be traced for each element and if the ray is not terminated by Russian Roulette, it will be stored back to the Ray Buffer for the next frame.
Lighting Path Texture is used for accumulating the lighting result when a ray is traversing along the path from the camera. It can be think of an intermediate result because the path is not fully traversed within a single frame, but across several frames.
Progress Buffer is a 8 bytes buffer, with 4 bytes storing the current path depth and other 4 bytes storing the total accumulated sample count.
Lighting Sample Texture is used for accumulating the lighting result of all the terminated rays (i.e. accumulating all the terminated ray result from Lighting Path Texture).

About operation done in each render pass:

Ray Tracing Pass dispatch a ray generation shader, sampling the Ray Buffer according to the DispatchRaysIndex() and then calling TraceRay() to calculate the lighting result inside the closest hit shader by randomly choosing diffuse or specular lighting (another shadow ray is traced towards light source during lighting calculation). The lighting result is added to Lighting Path Texture and non-terminated ray will be stored back into Ray Buffer for next frame.
Checking whether all rays are terminated by using D3D12 predicate on the counter buffer of Ray Buffer (i.e. all rays terminated when counter == 0). Then different shader/operation will be executed depends on whether the Ray Buffer is empty.
When there are still rays not terminated, increase the path depth in Progress Buffer.
When all rays are terminated, increase the sample count and set path depth to 0 in Progress Buffer.
Accumulate the current path lighting result in Lighting Path Texture to Lighting Sample Texture. Clear the Lighting Path Texture to 0 (it is cleared via compute shader instead of command list clear as the predicate does not work on the clear API, despite the spec say it would...)
Regenerate the rays in Ray Buffer with 1 ray per pixel using the camera transform for path tracing new lighting samples in next few frames.
Display the current lighting result to the back buffer.

Path traced images

With the core operations described above, at most 2 rays per pixel can be launched to maintain an interactive frame rate on my GTX1060. On more powerful machines (i.e. RTX cards with hardware accelerated ray tracing), step 1 don't need to be terminated with the first closest hit, but bounce a few more times before storing back to the Ray Buffer (A "#Bounce/Frame" option is added to increase the number of bounce per frame for RTX cards).

Number of bounce per frame option to adjust performance

The current approach described above has 2 drawbacks: First, we don't know how many rays are still left in Ray Buffer on CPU, DispatchRay() is called with the maximum number of rays (i.e. viewport width * height), and terminate early within the ray generation shader. This can be fixed in DXR Tier 1.1 using ExecuteIndirect() in the future. The second drawback is the performance is not constant across several frames, because the number of rays need to be traced decrease every frame and then reset back, so the frame rate fluctuate.

ACES tone mapping
After calculating the HDR lighting value, we need to perform a tone map pass to map the lighting value to a displayable range. ACES tone mapping is chosen due to its popularity in recent years. ACES has a few tone mapping curves (they call it RRT + ODT) for different display with different color gamut and viewing condition. Some common display types are sRGB_100nit and Rec709_100nit. The input of RRT+ODT function expect RGB values in ACES2065-1 (AP0) gamut with a white point around(but not exact) D60. So we need to convert our lighting value (L_sRGB) to AP0 gamut by multiplying a few transformation matrices:

operation to convert RGB value from sRGB to AP0

The above steps means first transforming sRGB values to XYZ color space with D65 white point (gamut transformation matrices can be calculated using the formula from here), then apply a Chromatic Adaption Transform(CAT) due to different white point between sRGB and AP0 (the matrix can be calculated using the formula from here). Finally, the XYZ value can be transformed to AP0 gamut. All these matrices can be combined to perform 1 matrix-vector multiplication as an optimization. Then this value can be feed into ACES RRT+ODT to compute the back buffer value for display.

So we just only need to select the appropriate ODT for the target display device. But unfortunately, not all common display gamut is provided, like my recently bought RTX laptop which comes with a 100% AdobeRGB color gamut monitor. ACES does not provide a suitable ODT to display the image in AdobeRGB color space. If using the common sRGB ODT, image will look too saturated. So I added a "Remap display color gamut" option in the demo:

Remapping option to display the path traced result according to display color primaies

The Remap display color gamut option performs the following steps on the output of RRT+ODT:

Apply EOTF function to the ODT output to get linear lighting value.
Transform the resulting RGB value in step 1 to the target display color gamut RGB value (e.g. AdobeRGB gamut on my laptop display), with Chromatic Adaption Transformation applied.
Apply OETF function to the output of step 2 for display.

By doing the above remapping, I can get similar result between my AdobeRGB laptop monitor and sRGB desktop monitor. But 1 drawback is that, although we can query the color primaries of the display, but it is not always accurate. For example, on my laptop, I can switch to regular sRGB view mode, but the IDXGIOutput6::GetDesc1() is still returning the AdobeRGB color primaries. I have also tried on some other monitors, they have color primaries greater than sRGB, but not exactly AdobeRGB or P3 primaries, and they also have different view mode such as AdobeRGB or sRGB. So I just leave the gamut remapping function optional in the demo and the user can choose their remap color primaries.

Also digging deeper in the ACES ODT source code, the 3 ODT used in the demo share many common code and only have different color space transform function / OETF at the end of ODT. In the future, I may refactor the RRT+ODT code and remove the remap display gamut function and directly transforn the ACES ODT output in XYZ space to the display gamut queried by IDXGIOutput6::GetDesc1() (or user selected gamut).

An ODT from ACES, the blue part is the same for all the 3 ODT used in the demo.
The orange part is different depends on display, which can be replaced by display
primaries returned from IDXGIOutput6::GetDesc1(), so the "Remap display color
gamut" in the demo can be removed in the future.

WCG rendering
Equipped with the knowledge of transforming between color spaces, I decide to try rendering in Wide Color Gamut instead. Games like GT Sport rendered in wide color already(Rec2020). Performing lighting calculation in wide color gamut can result in more accurate lighting than rendering in sRGB color space (despite displaying on sRGB monitor).

Path Traced result rendered in different color space. Left:sRGB, Center:ACEScg, Right:Rec2020

In the demo, it can path trace in sRGB, ACEScg, Rec2020 color space. Inside the closest hit shader, albedo texture is read and transformed into the chosen rendering color space from sRGB. Also the light color is converted to chosen rendering color space and then multiply with the intensity. Finally inside the tone mapping pass, the result of lighting calculation is transformed to AP0 color space and feed into ACES RRT+ODT for display. You may notice some difference between rendering in sRGB and wide color gamut (e.g. ACEScg and Rec2020). If you have a wide color gamut monitor (e.g. AdobeRGB or DCI-P3), you can try to use the Rec2020 ODT with the "Remap display color gamut" option on (described in last section). This can produce fewer color clamping and display more saturated color. But under normal lighting condition, the difference is not that much, we need to set up specific lighting such as using a local sphere light with saturated color, then those wide color can be displayed. I guess this is due to both albedo texture and light color are in sRGB space, content may need to be adjusted in order to take advantage of wide color display.

Wide Color path traced image, saved with different profile. Left:saved with sRGB profile Right: saved with AdobeRGB profile. The right image shows more saturated color when viewed on a color managed browser with a wide color display (e.g. iPhone monitor), otherwise 2 images may look the same.

Also, please note that this kind of wide color support is different from Windows 10 HDR/WCG settings. On my laptop, Windows report No for both HDR and WCG, but it do have an AdobeRGB monitor and capable of displaying wide color, we just need to correctly transform the images using the monitor color gamut.

My laptop has an AdobeRGB monitor, but Windows 10 Display capabilities report No for WCG.

ACES ODT blue light artifact
So far everything looks good when rendering in wide color space. Color get desaturated when they are over exposed. But it still has some issue when using a strong blue light...

Using a strong sRGB blue light will introduce hue shift...

It is because pure blue (0, 0, 255) in sRGB space is not saturated enough when transformed to wide color gamut (e.g. ACEScg/Rec2020). Looking inside the ACES dev repo, it has a blue light artifact fix LMT to fix this issue. It works by de-saturating the blue color a bit to lessen the hue shift. So in the demo, I provided a "Blue Correction" parameter to adjust the blue de-saturation strength (As a side note, UE4 also use ACES tone mapper and comes with a blue correction parameter in post process setting).

desaturating blue color to fix the hue shift

But I do like the saturated blue color, using the blue light artifact fix LMT will de-saturate the blue color made me sad. Below is the comparison between with/without the blue light LMT:

Left: without blue light fix LMT, Right: with blue light fix LMT

So may be we can work around the problem in the other way. Instead of making the blue color less saturated, we can make the light color more saturated. So I added a light "Color Picker Space" combo box to specify the color space of the picked RGB light value, so that more saturated blue light color can be chosen. By choosing an extremely saturated blue color (0, 0, 255) RGB value in ACEScg color space. We can get away with the purple color:

Using a saturated blue light in ACEScg space, without the blue light fix LMT

Bloom
Lastly, a bloom pass is added before tone mapping. Bloom pixels are extracted based on a threshold that exceeded the maximum luminance with the current exposure values. The maximum luminance is calculated with:

max luminance calculated using EV100

But simply subtracting the lighting value with the threshold will introduce some hue shift to the bloom color. So the RGB lighting value is transformed to HSV space, subtract the threshold from V, and then transform back to RGB space (We keep all the RGB values in the rendering space without transforming the lighting value to sRGB from ACEScg/Rec2020 during HSV conversion, as there are not much difference between the bloom results). Given an image with HDR values:

Input image for the bloom pass

The differences between using threshold in HSV space and RGB space:

Left column: bloom in HSV space. Right column: bloom in RGB space. Upper row: Lighted scene combined with bloom. Lower row: Debug images showing only the bloom component.

The bloom calculated using HSV space introduce less saturated color. The situation will be exaggerated when the image is over-exposed:

Left:Bloom input image. Center:Bloom in HSV space. Right: Bloom in RGB space.

Conclusion
In this post, the core algorithm of my DXR path tracer is described, together with some color space conversion. There are much more stuff to be done in the future like, support dynamic geometry during ray tracing, adding a denoiser for path traced output, implement hybrid rasterization/ray tracing rendering, spectral rendering to compute a ground truth reference. Also, this is my first time to write code about color space management. Currently, in the demo, the 3D lighting can be displayed correctly using the monitor gamut, but the UI is not managed properly. Also, 4K and HDR need to be supported too.

References
[1] https://seblagarde.files.wordpress.com/2015/07/course_notes_moving_frostbite_to_pbr_v32.pdf
[2] https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html#addressing-calculations-within-shader-tables
[2] https://github.com/ampas/aces-dev
[3] http://www.brucelindbloom.com/index.html?Eqn_RGB_XYZ_Matrix.html
[4] http://www.brucelindbloom.com/index.html?Eqn_ChromAdapt.html
[5] http://www.polyphony.co.jp/publications/sa2018/

Note on sampling GGX Distribution of Visible Normals

2020-01-20T01:56:00.002+08:00

Introduction

After writing an AO demo in last post, I started to write a progressive path tracer, but my progress was very slow due to the social unrest in past few months (here are some related news about what has happened). In the past weeks, the situation has claimed down a bit, and I continue to write my path tracer and started adding specular lighting. While implementing Eric Heitz's "Sampling the GGX Distribution of Visible Normals" technique, I was confused by why taking a random sample on a disk and then project it on the hemisphere equals to the GGX distribution of visible normals (VNDF). And I can't find a prove in the paper, so in this post, I will try to verify their PDF are equal. (Originally, I planned to write this post after finishing my path tracer demo. But I worry that the situation here in Hong Kong will get worse again and won't be able to write, so I decided to write it down first, hope it won't get too boring with only math equations.)

My work in progress path tracer, using GGX material only

Quick summary of sampling the GGX VNDF technique
For those who are not familiar with the GGX VNDF technique, I will briefly talk about it. It is an important sampling technique to sample a random normal vector from GGX distribution. That normal vector is then used for generating a reflection vector, usually for the next reflected ray during path tracing.

Traditional importance sampling scheme use D(N) to sample a normal vector

VNDF technique use the visible normal to importance sample a vector, taking the view direction into account

Given a view vector to a GGX surface with arbitrary roughness, the steps to sample a normal vector are:

Transform the view vector to GGX hemisphere configuration space (i.e. from arbitrary roughness to roughness = 1 config) using GGX stretch-invariant property.
Sample a random point on the projected disk along the transformed view direction.
Re-project the sampled point onto the hemisphere along view direction. And this will be our desired normal vector.
Transform the normal vector back to original GGX roughness space from the hemisphere configuration.

VNDF sampling technique illustration from Eric Heitz's paper

My confusion mainly comes from step 2 and 3, in the hemisphere configuration: why this method of generating normal vector equals to GGX VNDF exactly...

GGX NDF definition

Before digging deep into the problem, let's start with the definition of GGX NDF. In the paper, it states that: The GGX distribution uses only the upper part of the ellipsoid, and when alpha/roughness equals to 1, the GGX distribution is a uniform hemisphere. According to the definition (with alpha = 1):

So its PDF will be:

So, sampling a normal vector from GGX distribution (with alpha = 1) equals to sampling a vector using a cos-weighted distribution.

GGX VNDF definition
The definition of VNDF depends on the shadowing function. And we are using the Smith shadowing function (with alpha =1):

Therefore the VNDF equals to:

GGX VNDF specific case

With both GGX NDF and VNDF definition, we can start investigating the problem. I decided to start with something simple first, with a specific case: view direction equals to surface normal (i.e. V=Z).

After simplification in this V=Z case, the PDF of Dz(N) is also cos-weighted, which equals to the traditional sampling GGX NDF method.

Now take a look at the sampling scheme by Eric Heitz's method. The method start with uniform sampling from a unit disc, which has a PDF = 1/π, then the point is projected to the hemisphere along the view direction, which add a cos term to the PDF (i.e. Z.N/π ) according to Malley's method (where the cos term comes from the Jacobian transform). Therefore, both the VNDF and Eric Heitz's method are the same at this specific case, which has a cos weighted PDF.

GGX VNDF general case

To verify Eric Heitz's sampling scheme equals to the PDF of GGX VNDF in all possible viewing direction, we need to calculate the PDF of his method and take care of how the PDF changes according to each transformation. From the paper we have this vertical mapping:

Transformation of randomly sampled point from Eric Heitz's paper

We know the PDF of sampling an unit disk is 1/π, (i.e. P(t1, t2)= 1/π), we need to calculate P(t1, t2'):

(Edited on 28/11/2022: Thanks for Brian Collins pointing out, dt2'/dt1 was calculated incorrectly before)

The next step of the algorithm is to re-project the disc to the hemisphere along the view direction, which produce our target importance sampled normal, so by Malley's method again (but this time along the view direction instead of surface normal), we can add a V.N Jacobian term to the above PDF P(t1,t2'):

The resulting PDF equals to the GGX VNDF definition exactly. So this solved my question of why Eric Heitz's sampling scheme is an exact sampling routine for the GGX VNDF.

Conclusion

This post describe my learning process of the paper "Sampling the GGX Distribution of Visible Normals" and solved my most confusing part of why "taking a random sample on a disk and then project it on the hemisphere equals to the GGX VNDF". If anybody knows a simpler proof of how these 2 equations are equal, or if you discover any mistake, please let me know in the comment. Thank you.

References
[1] http://www.jcgt.org/published/0007/04/01/paper.pdf
[2] https://hal.archives-ouvertes.fr/hal-01509746/document
[3] https://agraphicsguy.wordpress.com/2015/11/01/sampling-microfacet-brdf/
[4] https://schuttejoe.github.io/post/ggximportancesamplingpart1/
[5] https://schuttejoe.github.io/post/ggximportancesamplingpart2/
[6] http://www.pbr-book.org/3ed-2018/Monte_Carlo_Integration/2D_Sampling_with_Multidimensional_Transformations.html

DXR AO

2019-09-30T23:54:00.001+08:00

Introduction
(Edit 3/5/2020: An updated version of the demo can be downloaded here, which support high DPI monitor and some bug fixes)
It has been 2 months since my last post. For the past few months, the situation here in Hong Kong was very bad. Our basic human rights are deteriorating. Absurd things happens such as suspected cooperation between police and triad, as well as the police brutality (including shooting directly at the journalists). I really don't know what can be done... May be, could you spare me a few minutes to sign some of these petitions? Although such petitions may not be very useful, at least after signing some of them, the US Congress is discussing the Hong Kong Human Rights and Democracy Act now. I would sincerely appreciate your help. Thank you very much!

Back to today's topic, after setting up my D3D12 rendering framework, I started to learn DirectX ray-tracing (DXR). So I decided to start writing an ambient occlusion demo first because it is easier than writing a full path tracer since I do not need to handle material information as well as the lighting data. The demo can be downloaded from here (required a DXR compatible graphics card and driver with Windows 10 build version 1809 or newer).

Rendering pipeline
In this demo, it renders a G-buffer with normal and depth data. Then a velocity buffer will be generated using current and previous frame camera transform, stored in RG16Snorm format. Then rays are traced from world position reconstructed from depth buffer with cosine weight distribution. To avoid ray-geometry self intersection, ray origin is shifted towards the camera a bit. After that, a temporal and spatial filter is applied to smooth out the noisy AO image and then an optional bilateral blur pass can be applied for a final clean up.

Temporal Filter
With the noisy image generated from the ray tracing pass, we can reuse previous frame ray-traced data to smooth out the image. In the demo, the velocity buffer is used to get the pixel location in previous frame (with an additional depth check between current frame depth value and the re-projected previous frame depth value). As we are calculating ambient occlusion using Monte Carlo Integration:

We can split the Monte Carlo integration into multiple frames and store the AO result into a RG16Unorm texture, where red channel stores the accumulated AO result, green channel stores the total sample count N (The sample count is clamped to 255 to avoid overflow). So after a new frame is rendered, we can accumulated the AO Monte Carlo Integration with the following equation:

We also reduce the sample count by the delta depth difference between current and previous frame depth buffer value (i.e. when the camera zoom out/in) to "fade out" the accumulated history faster to reduce ghosting artifact.

AO image traced at 1 ray per pixel

AO image with accumulated samples over multiple frames

But this re-projection temporal filter have a short coming that the geometry edge would failed very often (especially when done in half resolution). So in the demo, when re-projection failed, it will shift 1 pixel to perform the re-projection again to accumulate more samples.

Many edge pixels failed the re-projection test

With 1 pixel shifted, many edge pixels can be re-projected

As the result is biased, I have also reduced the sample count by a factor of 0.75 to make the correct ray-traced result "blend in" faster.

Spatial Filter
To increase the sample count for Monte Carlo Integration, we can reuse the ray-traced data in the neighbor pixels. We search in 5x5 grid and reuse the neighbor data if they are on the same surface by comparing their delta depth value (i.e. ddx and ddy generated from depth buffer). As the delta depth value is re-generated from depth buffer, some artifact may been seen on the triangle edge.

noisy AO image applied with a spatial filter

artifact shown at the triangle edge by re-constructed delta depth

To save some performance, beside using half resolution rendering, we can also choose to interleave the ray cast every 4 pixels and ray cast the remaining pixels in next few frames.

Rays are traced only at the red pixels
to save performance

For those pixels without any ray traced data during interleaved rendering, we use the spatial filter to fill in the missing data. The same surface depth check in spatial filter can be by-passed when the sample count(stored in green channel during temporal filter) is low, because it is better to have some "wrong" neighbor data than have no data for the pixel. This also helps to remove the edge artifact shown before.

Rays are traced at interleaved pattern,
leaving many 'holes' in the image

Spatial filter will fill in those 'holes'
during interleaved rendering

Also, when ray casting are interleaved between pixels, we need to pay attention to the temporal filter too. We may have a chance that we re-project to previous frame pixel which have no sample data. In this case, we snap the re-projected UV to the pixel that have cast interleaved ray in previous frame.

Bilateral Blur
To clean up the remaining noise from the temporal and spatial filter. A bilateral blur is applied, we can have a wider blur by using the edge aware A-Trous algorithm. The blur radius is adjusted according to the sample count (stored in green channel in temporal filter). So when we have already cast many ray samples, we can reduce the blur radius to have a sharper image.

Applying an additional bilateral blur to smooth out remaining noise

Random Ray Direction
When choosing the random ray cast direction, we want those chosen direction can have a more significance effect. Since we have a spatial filter to reuse neighbor pixels data, so we can try to cast rays in directions such that the angle between the ray direction in neighbor pixels should be as large as possible and also cover as much hemisphere area as possible.

It looks like we can use some kind of blue noise texture so that the ray direction is well distributed. Let's take a look at how the cosine weighted random ray direction is generated:

From the above equation, the random variable ϕ is directly corresponding to the random ray direction on the tangent plane, which have a linear relationship between the angle ϕ and random variable ξ2. Since we generate random numbers using wang hash, which is a white noise. May be we can stratified the random range and using the blue noise to pick a desired range to turn it like a blue noise pattern. For example, originally we have a random number between [0, 1), we may stratified it into 4 ranges: [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1). Then using the screen space pixel coordinates to sample a tileable blue noise texture. And according to the value of the blue noise, we scale the white noise random number into 1 of the 4 stratified range. Below is some sample code of how the stratification is done:

int BLUE_NOISE_TEX_SIZE= 64;
int STRATIFIED_SIZE= 16;
float4 noise= blueNoiseTex[pxPos % BLUE_NOISE_TEX_SIZE];
uint2 noise_quantized= noise.xy * (255.0 * STRATIFIED_SIZE / 256.0);
float2 r= wang_hash(pxPos); // random white noise in range [0, 1)
r = mad(r, 1.0/STRATIFIED_SIZE, noise_quantized * (1.0/STRATIFIED_SIZE));

With the blue noise adjusted ray direction, the ray traced AO image looks less noisy visually:

Rays are traced using white noise

Rays are traced using blue noise

Blurred white noise AO image

Blurred blue noise AO image

Ray Binning
In the demo, ray binning is also implemented, but the performance improvement is not significant. The ray binning only show a large performance gain when the ray tracing distance is large (e.g. > 10m) and turning off both half resolution and interleaved rendering. I have only ran the demo on my GTX 1060, may be the situation will be different on RTX graphcis card (so, this is something I need to investigate in the future). Also the demo may have a slight difference when toggling on/off ray binning due to the precision difference using RGBA16Float format to store ray direction (the difference will be vanished after accumulating more samples over multiple frames with temporal filter).

Conclusion
In this post, I have described how DXR is used to compute ray-traced AO in real-time using a combination of temporal and spatial filter. Those filters are important to increase the total sample count for Monte Carlo Integration and getting a noise free and stable image. The demo can be downloaded from here. There are still plenty of stuff to improve, such as having a better filter, currently, when the AO distance is large and both half resolution and interleaved rendering is turned on (i.e. 1 ray per 16 pixels), the image is too noisy and not temporally stable during camera movement. May be I need to improve those stuff when writing a path tracer in the future.

References
[1] DirectX Raytracing (DXR) Functional Spec https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html
[2] Edge-Avoiding À-Trous Wavelet Transform for fast GlobalIllumination Filtering https://jo.dreggn.org/home/2010_atrous.pdf
[3] Free blue noise textures http://momentsingraphics.de/BlueNoise.html
[4] Quick And Easy GPU Random Numbers In D3D11 http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/
[5] Leveraging Real-Time Ray Tracing to build a Hybrid Game Engine http://advances.realtimerendering.com/s2019/Benyoub-DXR%20Ray%20tracing-%20SIGGRAPH2019-final.pdf
[6] ”It Just Works”: Ray-Traced Reflections in 'Battlefield V' https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s91023-it-just-works-ray-traced-reflections-in-battlefield-v.pdf

Reflection and Serialization

2019-07-14T13:53:00.000+08:00

Introduction
Reflection and serialization is a convenient way to save/load data. After reading "The Future of Scene Description on 'God of War'", I decided to try to write something called "Compile-time Type Information" described in the presentation (but a much more simple one with less functions). All my need is something to save/load C style struct (something like D3D DESC structure, e.g. D3D12_SHADER_RESOURCE_VIEW_DESC) in my toy engine.

Reflection
A reflection system is needed to describe how struct are defined before writing a serialization system. This site has many information about this topic. I use a similar approach to describe the C struct data with some macro. We define the following 2 data types to describe all possible struct that need to be reflected/serialized in my toy engine (with some variables omitted for easier understanding).

As you can guess from their names, TypeInfo is used to described the C struct that need to be reflected/serialized. And TypeInfoMember is responsible for describing the member variables inside the struct. We can use some macro tricks to reflect a struct(more can be found in the reference):

struct reflection example

The above example reflect 3 variables inside struct vec3: x, y, z. The tricks of those macro is to use sizeof(), alignof(), offsetof() and using keyword. The sample implementation can be found below:

This approach has one disadvantage that we cannot use bit field to specify how many bits are used in a variable. And bit field order seems to be compiler dependent. So I just don't use it for the struct that need to be reflected.

It also has another disadvantage that it is error-prone to reflect each variable manually. So I have written a C struct header parser (using Flex & Bison) to generate the reflection source code. So, for those C struct file that need to auto generate the reflection data, instead of naming the source code file with extension .h, we need to name it with another file extension (e.g. .hds) and using visual studio custom MSBuild file to execute my header parser. To make visual studio to get syntax high light for this custom file type, We need to associate this file extension with C/C syntax by navigate to

"Tools" -> "Options" -> "Text Editor" -> "File Extension"

and add the appropriate association:

But one thing I cannot figure out is the auto-complete when typing "#include" for custom file extension, looks like visual studio only filtered for a couple of extensions (e.g. .h, .inl, ...) and cannot recognize my new file type... If someone knows how to do it, please leave a comment below. Thank you.

MSVC auto-complete filter for .h file only and cannot discover the new type .hds

Serialization
With the reflection data available, we know how large a struct is, how many variables and their byte offset from the start of the struct, so we can serialize our C struct data. We define the serialization format with data header and a number of data chunks as following figure:

Memory layout of a serialized struct

Data Header
The data header contains all the TypeInfo used in the struct that get serialized, as well as the architecture information(i.e. x86 or x64). During de-serialization, we can compare the runtime TypeInfo against the serialized TypeInfo to check whether the struct has any layout/type change (To speed up the comparison, we generate a hash value for every TypeInfo by using the file content that defining the struct). If layout/type change is detected, we de-serialize the struct variables one by one (and may perform the data conversion if necessary, e.g. int to float), otherwise, we de-serialize the data in chunks.

Data Chunk
The value of C struct are stored in data chunks. There are 6 types of data chunks: RawBytes, size_t, String, Struct, PointerSimple, PointerComplex. There are 2 reasons to divide the chunk into different types: First, we want to support the serialized data to be used on different architecture (e.g. serialized on x86, de-serialized on x64) where some data type have different size depends on architecture(e.g. size_t, pointers). Second, we want to support serializing pointers(with some restriction). Below is a simple C struct that illustrate how the data are divided into chunks:

This Sample struct get serialized into 3 data chunks

RawBytes chunk
RawBytes chunk is a chunk that contains a group of values where the size of those variables are architecture independent. Refer to the above Sample struct, the variables val_int and val_float are grouped into a single RawBytes chunk so that during run time, those values can be de-serialized by a single call to memcpy().

size_t chunk
size_t chunk is a chunk that contains a single size_t value, which get serialized as a 64 bit integer value to avoid data loss. But loading a too large value on x86 architecture will cause a warning. Usually this type will not be used, I just add it in case I need to serialize this type for third party library.

String chunk
String chunk is used for storing the string value of char*, the serializer can determine the length the string (looking for '\0') and serialize appropriately.

Struct chunk
Struct chunk is used when we serialize a struct that contains another struct which have some architecture dependent variables. With this chunk type, we can serialize/de-serialize recursively.

The ComplexSample struct contains a Complex struct that has some architecture dependent values,
which cannot be collapsed into a RawBytes chunk, so it get serialized as a struct chunk instead.

PointerSimple chunk
PointerSimple chunk is storing a pointer variable. And the size of the data referenced by this pointer does not depend on architecture and can be de-serialized by a single memcpy() similar to the RawBytes chunk. To determine the length of a pointer (sometimes pointer can be used like an array), my C struct header parser recognizes some special macro which define the length of the pointer (and this macro will be expanded to nothing when parsed by normal Visual Studio C/C++ compiler). Usually during serialization, the length of the pointer depends on another variable within the same struct, so with the special macro, we can define the length of the pointer like below:

The DESC_ARRAY_SIZE() macro tells the serializer that
the size depends on the variable num within the same struct

When serializing the above struct, the serializer will look up the value of the variable num to determine the length of the pointer variable data, so that we know how bytes are needed to be serialized for data.

But using this macro is not enough to cover all my use case, for example when serializing D3D12_SUBRESOURCE_DATA for a 3D texture, the pData variable length cannot be simply calculated by RowPitch and SlicePitch:

A sample struct to serialize a 3D texture, which the length of
D3D12_SUBRESOURCE_DATA::pData depends on the depth of the resources

The length can only be determined when having access to the struct Texture3DDesc, which have the depth information. To tackle this, my serializer can register custom pointer length calculation callback (e.g. register for the D3D12_SUBRESOURCE_DATA::pData variable inside Texture3DDesc struct). The serializer will keep track of a stack of struct type that is currently serializing, so that the callback can be trigger appropriately.

Finally, if a pointer variable does not have any length macro nor registered length calcuation callback, we assume the pointer have a length of 1 (or 0 if nullptr).

PointerComplex chunk
PointerComplex chunk is for storing pointer variable, with the data being referenced is platform dependent, similar to the struct chunk type. It has the same pointer length calculation method as PointerSimple chunk type.

Serialize union
We can also serialize struct with union values that depends on another integer/enum variable, similar to D3D12_SHADER_RESOURCE_VIEW_DESC. We utilize the same macro approach used for pointer length calculation. For example:

A sample to serialize variables inside union

In the above example, the DESC_UNION() macro add information about when the variable need to be serialized. During serialization, we check the value of variable type, if type == ValType::Double, we serialize val_double, else if type == ValType::Integer, we serialize val_integer.

Conclusion
This post have described how a simple reflection system for C struct is implemented, which is a macro based approach, assisted with code generator. Based on the the reflection data, we can implement a serialization system to save/load the C struct using compile time type information. This system is simple, but it does not support complicated features like C++ class inheritance. And it is mainly for serializing C style struct, which is enough for my current need.

References
[1] https://preshing.com/20180116/a-primitive-reflection-system-in-cpp-part-1/
[2] https://www.gdcvault.com/play/1026345/The-Future-of-Scene-Description
[3] https://blog.molecular-matters.com/2015/12/11/getting-the-type-of-a-template-argument-as-string-without-rtti/

Render Graph

2019-07-07T20:53:00.001+08:00

Introduction
Render graph is a directed acyclic graph that can be used to specify the dependency of render passes. It is a convenient way to manage the rendering especially when using low level API such as D3D12. There are many great resources talked about it such as this and this. In this post I will talk about how the render graph is set up, render pass reordering as well as resources barrier management.

Render Graph set up
To have a simplified view of render graph, we can treat each node inside a graph as single render pass. For example we can have a graph for a simple deferred renderer like this:

Render passes dependency within a render graph

By having such graph, we can derive the dependency of the render passes, remove unused render pass, as well as reorder them. In my toy graphics engine, I use a simple scheme to reorder render passes. Taking the below render graph as an example, the render pass are added as following order:

A render graph example

We can group it into several dependency levels like this:

split render passes into several dependency levels

Within each level, the passes are independent and can be reordered freely, so the render passes are enqueued into command list as the following order:

Reordered render passes

Between each dependency level, we batch resources barrier to transit the resources to the correct state.

Transient Resources
The above view is just a simplified view of the graph. In fact, each render pass consist of a number of inputs and outputs. Every input/output is a graphics resource (e.g. texture). And render passes are connected through such resources within a render graph.

Render graph connecting render passes and resources

As you can see in the above example, there are many transient resources used (e.g. depth buffer, shadow map, etc). We handle such transient resources by using a texture pool. Texture will be reused after it is no longer need by previous pass (placed resources is not used for simplicity). When building a render graph, we compute the life time of every transient resources (i.e. the dependency level that the resource start/end). So we can free the transient resources when the execution go beyond a dependency level and reuse them for later render pass. So to specify a render pass input/output in my engine, I only need to specify their size/format and don't need to worry about the resources creation and the transient resources pool will create the textures as well as the required resources view (e.g. SRV/DSV/RTV).

Conclusion
In this post, I have described how render passes are reordered inside render graph, when barrier are inserted and transient resources handling. But I have not implemented parallel recording of command lists and async compute. It really takes much more effort to use D3D12 than D3D11. I think the current state of my hobby graphics engine is good enough to use. Looks like I can start learning DXR after spending lots of effort on basic D3D12 set up code. =]

References
[1] https://www.gdcvault.com/play/1024612/FrameGraph-Extensible-Rendering-Architecture-in
[2] https://ourmachinery.com/post/high-level-rendering-using-render-graphs/

D3D12 Constant Buffer Management

2019-07-02T23:59:00.000+08:00

Introduction
In D3D12, it does not have an explicit constant buffer API object (unlike D3D11). All we have in D3D12 is ID3D12Resource which need to be sub-divided into smaller region with Constant Buffer View. And it is our job to handle the constant buffer life time and avoid updating constant buffer value while the GPU is still using it. This post will describe how I handle this topic.

Constant buffer pool
We allocate a large ID3D12Resource and treat it as an object pool by sub-dividing it into many small constant buffers (Let's call it constant buffer pool). Since constant buffer required to be 256 bytes aligned (I can only find this requirement in previous documentation, while the updated document only have such requirement in the Uploading Texture Data Through Buffers, which is under a section about texture...), so I defined 3 fixed size pools 256/512/1024 bytes pool. Only this 3 size type is enough for my need as most constant buffers are small (In Seal Guardian, the largest constant buffer size is 560 bytes, while large data like skinning matrix palette is uploaded via texture).

3 constant buffer pools with different size

In last post, a non shader visible descriptor heap manager is used to handle non shader visible descriptors. But in fact, that is only used for SRV/DSV/RTV descriptors. Constant buffer view are managed with another scheme. As described above, when we create a ID3D12Resource for constant buffer pool, we also create a non shader visible ID3D12DescriptorHeap with size large enough to have descriptors point to all the constant buffers inside the constant buffer pool.

ID3D12Resource and ID3D12DescriptorHeap are created in pair

We also split constant buffer pool based on their usage: static/dynamic. So there are total 6 constant buffer pools inside my toy engine (static 256/512/1024 bytes pool + dynamic 256/512/1024 bytes pool).

Dynamic constant buffer
Constant buffer can be updated dynamically. Each constant buffer contains a CPU side copy of their constant values. When they are binded before a draw call, those values will be copied to the dynamic constant buffer pool (created in upload heap). A piece of memory for constant buffer values will be allocated from the constant buffer pool in a ring buffer fashion. If the pool is full (i.e. ring buffer wrap around too fast where all the constant buffers are still in use by GPU), we will create a larger pool and the existing pool will be deleted after all related GPU commands finish execution.

Resizing dynamic constant buffer pool, the previous pool
will be deleted after executing related GPU commands

To avoid copying the same constant buffer values to the constant buffer pool when having multiple binding constant buffer calls. We keep 2 integer values for every dynamic constant buffer: "last upload frame index" and "value version". The last upload frame index is the frame index that those CPU constant buffer values get copied to the dynamic pool. The value version is an integer which is monotonic increased every-time the constant buffer value get modified/updated. So by checking this 2 integers, we can avoid duplicated copies of constant buffer in dynamic pool and re-use the previous copied values.

Static constant buffer
The static constant buffer will have a static descriptor handle described in last post. The static constant buffer pool is created in the default heap. The pool is managed in a free-list fashion as oppose to ring buffer in dynamic pool. Also when the pool is full, we still create extra pool for new constant buffer allocation request. But different from dynamic pool, previous pool will not be deleted when new pool get created.

Creating more static constant buffer pool if existing pools are full

To upload static constant buffer values to the GPU(since static pools are created in default heap), we use the dynamic constant buffer pool instead of creating another new upload heap. Every frame, we gather all newly created static constant buffers, then before we start rendering in this frame, we copy all the CPU constant buffer values to the dynamic constant buffer pool and then schedule a ID3D12GraphicsCommandList::CopyBufferRegion() call to copy those values from upload heap to default heap. By grouping all the static constant buffer uploads, we can reduce the number of D3D12_RESOURCE_BARRIER to transit between the D3D12_RESOURCE_STATE_COPY_DEST and D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER states.

Conclusion
In this post, I have described how constant buffers are managed in my toy engine. It use a number of different pool size which is managed in ring buffer fashion for dynamic constant buffers and in free-list fashion for static constant buffers. Uploading of static constant buffer content are grouped together to reduce barrier usage. However, I only split the usage based simply on static/dynamic, I would like to investigate the performance in the future whether adding another usage type for some use case like constant buffer will be updated every frame, and used frequently in many draw calls (e.g. write once, read many within a frame) and would like to place those resources on the default heap instead of the current dynamic upload heap.

Reference
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/large-buffers
[2] https://www.gamedev.net/forums/topic/679285-d3d12-how-to-correctly-update-constant-buffers-in-different-scenarios/

D3D12 Descriptor Heap Management

2019-06-29T16:03:00.001+08:00

Introduction
Continue with the last post, we described about how root signature is managed to bind resources. But root signature is just one part of resources binding, we also need to use descriptor to bind reousrces. Descriptors are small block of memory describing an object (CBV/SRV/UAV/Sampler) to GPU. They are stored in descriptor heaps, and they may be shader visible or non shader visible. In this post, I will talk about how descriptors are managed for resources binding in my toy graphics engine.

Non shader visible heap
Let's start with the non-shader visible heap management. We can treat a descriptor as a pointer to a GPU resource (e.g. texture). Descriptor heap is a piece of memory used for storing descriptors and the size of a single descriptor can be queried by ID3D12Device::GetDescriptorHandleIncrementSize(). So we treat descriptor heap as an object pool, and every descriptor within the same heap can be referenced by an index.

Non shader visible descriptor heap containing N descriptors

Since we don't know how many descriptors are needed in-advance and we may have many non shader visible heaps, A non shader visible heap manager is created for allocating a descriptor from descriptor heap(s). This manager contains at least 1 descriptor heap. When a descriptor allocation request is made to the manager, it will first look for free descriptor from existing descriptor heap(s), if none is found, a new descriptor heap will be created to handle the request.

Descriptor heap manager handles descriptor allocation request, create descriptor heap if necessary

So within the graphics engine, we use a "non shader visible descriptor handle" to reference a D3D12 descriptor which store the heap index and descriptor index with respect to a descriptor heap manager. All the textures created in the engine will have a "non shader visible descriptor handle" for resources binding (more on this later).

Shader visible heap
Next, we will talk about shader visible heap management. Shader visible heap is responsible for binding resources that get used in shaders. It is recommend that just only 1 heap is used for all frames so that asynchronous compute and graphics workload can be run in parallel(on NVidia hardware). So we just create 1 large shader visible heap at the start of program and don't bother to resize/allocate a lager heap when the heap is full (we just assert in this case). With a single large shader visible descriptor heap, it is divided into 2 regions: static / dynamic.

A single large shader visible descriptor heap, divided into 2 regions

Dynamic descriptor
Dynamic descriptors are used for some transient resources that their descriptor table cannot be reused often. During resources binding (e.g. texture), their non-shader visible descriptors will be copied to the shader visible heap via ID3D12Device::CopyDescriptors(), where the copy destination (i.e. dynamic shader visible descriptors) is allocated in a ring buffer fashion (Note the copy operation have a restriction that the copy source must be in non-shader visible heap, that's why we allocate a "non shader visible descriptor handle" for every texture).

Static descriptor
Static descriptors are used for resources which can be grouped together into a descriptor table, so that they can be reused over multiple frames. For example, a set of textures inside a material will not be changed very often, those textures can be grouped into a descriptor table. My current approach is to use a "stack" based approach to manage the static shader visible descriptor heap. Instead of creating a stack of individual descriptor, we have a stack of groups of descriptors, often, during level load, 1 static descriptor group will be created.

static descriptors are packed into group during level load

Inside a group of static descriptors, the descriptors are sorted such that all constant buffer descriptors appear before texture descriptors. Also null descriptors may need to be added to respect the Hardware Tiers restriction. To identify a static descriptor in shader visible heap, we can use the stack group index together with the descriptor index within the group.

descriptor are ordered by type, with necessary padding

Each "static resource"(e.g. constant buffer/texture) will have a "static descriptor handle" beside the "non shader visible descriptor handle". We can check whether those resources are within the same descriptor table by comparing the stack group index and descriptor index to see whether they are in consecutive order. With such information, we can create a resource binding API similar to D3D11 (e.g. ID3D11DeviceContext::PSSetShaderResources() ), if all the resources in the API call are in the same descriptor table, we can use the static descriptor to bind the descriptor table directly, otherwise, we switch to use the dynamic descriptor approach described in previous section to create a continuous descriptor table. (I have also think of instead of using similar binding API as D3D11, may be I can create a so call "descriptor table" object explicitly, say during material loading and grouping material textures into a descriptor table, so that resources binding can skip the consecutive descriptor index check described above. But currently I just stick with a simple solution first...)

As mentioned before, the static descriptor group is allocated based on a "stack" based approach. But my current implement is not strictly "last in - first out", we can removing a group in between and make some "hole" in the static shader visible heap region, but this will result in fragmentation.

Fragmented static descriptor heap region

In theory, we can defragment this heap region by moving descriptor groups to un-used space (it works as we use index to reference descriptors inside a heap instead of address directly) and during defragmentation, we may switch to use dynamic descriptors temporarily to avoid overwriting the static heap region while the GPU commands are still using it. But currently, I have not implemented the defragmentation yet because I only get one simple level (i.e. only 1 static descriptor group) now...

Conclusion
In this post, I have described how the descriptor heap is managed for resources binding. To sum up, the shader visible descriptor heap is divided into 2 regions: static/dynamic. Static descriptor heap is managed in a "stack" based approach. During level loading, all the static CBV/SRV descriptors are stored within a static descriptor stack group, which is a big continuous descriptor table. This will increase the chance to reuse the descriptor table. In addition to this optional static descriptor, every resources must have a non-shader visible descriptor handle. This non-shader visible descriptor handle is used when a static descriptor table cannot be used during resource binding, and it will get copied to the shader visible heap to form a new descriptor table. With this kind of heap management, we can create a resources binding API similar to D3D11, which call the underlying D3D12 API using descriptors.

References
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/resource-binding
[2] https://www.gamedev.net/forums/topic/686440-d3d12-descriptor-heap-strategies/

D3D12 Root Signature Management

2019-06-22T17:38:00.001+08:00

Introduction
Continue with the last post about writing my new toy D3D12 graphics engine, we have compiled some shaders and extracted some reflection data from shader source. The next problem is to bind resources(e.g. constant buffer / textures) to the shaders. D3D12 use root signatures together with root parameters to achieve this task. In this post, I will describe how my toy engine create root signatures automatically based on shader resources usage.

Left: new D3D12 graphics engine (with only basic diffuse material)

Right: previous D3D11 rendering (with PBR material, GI...)

Still a long way to go to catch up with the previous renderer...

Resource binding model
In D3D12, shader resource binding relies on the root parameter index. But when iterating on shader code, we may modify some resources binding(e.g. add a texture variable / remove a constant buffer), the root signature may be changed, which cause the change of root parameter index. This will need to update all function call like SetGraphicsRootDescriptorTable() with new root parameter index, which is tedious and error-prone... Compare to the resource binding model in D3D11 (e.g. PSSetShaderResources(), PSSetConstantBuffers()), it doesn't have such problem as the API defined a set of fixed slots to bind with. So I would prefer to work with a similar binding model in my toy engine.

So, I defined a couple of slots for resource binding as follow (which is a bit different than D3D11):

Engine_PerDraw_CBV
Engine_PerView_CBV
Engine_PerFrame_CBV
Engine_PerDraw_SRV_VS_ONLY
Engine_PerDraw_SRV_PS_ONLY
Engine_PerDraw_SRV_ALL
Engine_PerView_SRV_VS_ONLY
Engine_PerView_SRV_PS_ONLY
Engine_PerView_SRV_ALL
Engine_PerFrame_SRV_VS_ONLY
Engine_PerFrame_SRV_PS_ONLY
Engine_PerFrame_SRV_ALL
Shader_PerDraw_CBV
Shader_PerDraw_SRV_VS_ONLY
Shader_PerDraw_SRV_PS_ONLY
Shader_PerDraw_SRV_ALL
Shader_PerDraw_UAV

Instead of having a fixed slot per shader stage in D3D11, my toy engine fixed slots can be summarized into 3 categories as:

Resource binding slot categories

Slot category "Resource Type"
As described by its name (CBV/SRV/UAV), this slot is used to bind the corresponding resource type like constant buffer / shader resource view / unordered access view.
For SRV type, it further sub-divide into VS_ONLY / PS_ONLY / ALL sub-categories which refer to the shader visibility. According to Nvidia Do's An Don'ts, limiting the shader visibility will improve the performance.
For CBV type, the shader visibility will be deduced from shader reflection data during root signature and PSO creation.

Slot category "Change frequency"
Resources are encouraged to be bound based on their update frequency. So this slot category are divided into 3 types: Per Frame/ Per View / Per Draw.
For the Per Frame/View types, they will have a root parameter type as descriptor table.
While the Per Draw CBV type will have root parameter type as root descriptor.
For Per Draw SRV type, it still uses descriptor table instead of root descriptor, because for example, it is common to have only 1 constant buffer for material of a mesh while binding multiple textures for the same material. So using descriptor table instead will help to keep the size of root signature small.

Slot category "Usage"
This category is used for sub-dividing into different usage patterns: Engine/Shader.
For Engine usage, it will typically be binding stuff like mesh transform constant, camera transform, etc.
For Shader usage, it is used for something like shader specific stuff, e.g. material constant.
I just can't find the appropriate name for this category, and simply use the name as Engine/Shader. May be it is better to call them Group 0/1/2/3... in case I may have different usage patterns in the future. But currently I just don't bother with it now...

Shader Reflection
In last post, I have mentioned that during shader compilation, shader reflection data is exported. This is important for the root signature creation. From these reflection data, we can know which constant buffer/texture slots get used. When creating a pipeline state object(PSO) from shaders, we can deduce all the resources slots get used in PSO (as well as the shader visibility for constant buffer) and then create an appropriate root signature with each resource slot mapped to the corresponding root parameter index (let's call this mapping data as "root signature info").

To specify the resource slot in shader code, we make use of the register space introduced in Shader Model 5.1. We can define which slot is used for constant buffer/texture. For example:

#define ENGINE_PER_DRAW_SRV_ALL space5 // all shaders must have the same slot-space definition
Texture2D shadowMap : register(t0, ENGINE_PER_DRAW_SRV_ALL);

With the above information, on the CPU side code, we can bind a resource to a specific slot using the root parameter index stored inside the "root signature info" similar to D3D11 API.

Conclusion

In this post, we have described how root signature can be automatically created and used for a slot based API. First root signature are created(or re-used/shared) during the creation of pipeline state object(PSO) based on its shader reflection data. We also create a "root signature info" to store the mapping between resource slots and root parameter index together with the root signature and PSO. Then we can use this "root signature info" to bind the resources to the shader.

As this is my first time to write a graphics engine with D3D12. I am not sure whether this resource binding model is the best. I have also think of other naming scheme for the resource slots: instead of naming with PerDraw / PerView type, is it better to name it explicitly with RootDescriptor / DescriptorTable instead? May be I will change my mind after I gained more experience in the future...

Reference
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/root-signatures
[2] https://developer.nvidia.com/dx12-dos-and-donts

MSBuild custom build tools notes

2019-06-14T17:11:00.003+08:00

Introduction
Recently, I am trying to re-write my graphics code to use D3D12 (instead of D3D11 in Seal Guardian). I need to have a convenient way to compile shader code. While tidying up the MSBuild custom build steps files used in Seal Guardian for my new toy graphics engine, I regret that I did not write a blog post about custom MSBuild at that time, as I remembered, finding such information was hard at that time and I need to look at some of the CUDA custom build files to guess how it works. So this post will just be my personal notes about custom MSBuild and I don't guarantee all information about MSBuild are 100% correct. I have uploaded an example project to compile shader files here. Interested readers may also check out this excellent post about MSBuild written by Nathan Reed.

Custom build steps set up
MSBuild need to have .targets file to describe how the compiler (e.g. dxc/fxc used for shader compilation) are invoked. In the uploaded example project, we have 3 main targets: DXC, JSON, BIN.

- DXC target: described by its name, invoking the dxc.exe to compile HLSL file.
- JSON target: used to invoke the shaderCompiler.exe, which is our internal tool written using Flex & Bison to parse the shader source code to output some meta data, like texture/constant buffer usage for root signature management.
- BIN target: a task that depends on DXC task and JSON task, invoke the dataBuilder.exe, our internal tool for data serialization/deserialization into our binary format, combining the output from DXC and JSON task.

target dependency

Although MSBuild can set up the target dependency, but it looks like those independent targets are not executed in parallel. In Seal Guardian, when compiling the surface shaders which generate the shader permutation for lighting, this result in a long compilation time. At the end, I need to create another exe to launch multiple threads to speed up the shader compilation. May be I was setting up MSBuild incorrectly, if anyone knows how to parallelize it, please let me know in the comment below. Thank you!

Incremental Builds
MSBuild use .tlog file to track files modification to avoid unnecessary compilation (also affect which files got deleted when cleaning the project). There are 2 tlog files (read.1.tlog and write.1.tlog), one is for tracking whether the source files are modified, and the other is tracking whether the output file is up to date. We can simply use the WriteLinesToFile task to mark such dependency, e.g.

But doing this only will make the tlog file larger and larger after every compilation. So it is better to read the tlog file content into a PropertyGroup and check whether the file already contains the text we would like to write using a "Conidition" inside the WriteLinesToFile task. For details, please take a look at the example project.

Also, as a side note, do not include $(MSBuildProjectFile) property in the "Inputs" element inside "Target" task. I did it accidentally and it cause the whole project to recompile all the shaders every time a new shader file is added to / removed from the project. This is not necessary as most of the shader files are independent.

Output files
Like every visual studio project, our example project have a Debug and Release configuration. After executing the BIN task described above, we also use a Copy task to copy our compiled shader from Debug/Release config $(OutDir) directory to our content directory. We can also use the Property Function MakeRelative() to maintain the directory hierarchy in the output directory. This is another reason why I use Copy task instead of specifying the $(OutDir) to the content directory, as I cannot get a nested Property Function working inside the .props file (or may be I did something wrong? I don't know...)...

Also, beside output files, another note is about the output log. If we want to write something to the output console of visual studio from your custom exe (e.g. dataBuilder.exe/dataBuilder.exe in the example project), the text must be in a specific format like (but I cannot find the documentation of the exact format, just guess from similar message emitted from visual studio...):

1>C:/shaderFileName.hlsl(123): error : error message

otherwise, those message will not get displayed in output window.

Example project
An example project is uploaded here. It will compile vertex/pixel shaders with dxc.exe and output JSON meta data to the $(IntDir), then combine those data and write to the $(OutDir). Finally those files will be copied to the asset directory with the corresponding relative path to the source directory. Please note that the shaderCompiler.exe used for outputting meta-data is for internal tools, which have some custom restriction on the HLSL grammar for my convenience to create root signature. It is used just as an example to illustrate how to set up a custom MSBuild tool, feel free to replace/modify those files to suit your own need. Thank you.

Reference
[1] https://docs.microsoft.com/en-us/cpp/build/understanding-custom-build-steps-and-build-events?view=vs-2019
[2] http://www.reedbeta.com/blog/custom-toolchain-with-msbuild/

Testing Hot-Reload DLL on Windows

2018-10-07T18:04:00.002+08:00

Introduction
After finishing the game Seal Guardian and taking some rest, I was recently back to refactoring the engine code of the game Seal Guardian. In this game, the engine has the ability to hot-reload all asset type from texture, shader, game level to Lua script. But it lacks the ability to hot-reload C/C++ files. So I decided to spend some time on finding resources about hot reload C/C++. It turns out hot-reload C/C++ is not that trivial on Windows as the PDB file is locked. And I found this approach of patching the PDB path inside DLL looks interesting. So I gave it a try and the sample program is uploaded to here (only tested with Visual Studio Community 2017).

First try
Because the PDB file path is hard coded inside the DLL file, the approach used by cr.h is to correctly parse the DLL file to find the PDB file path and replace it with another new file path according to the Portable Executable format.

So I tried something similar, but different from cr.h, instead of generate a new DLL/PDB file name every time the DLL get re-compiled, I use a fixed temporary name (I don't want to have many random files inside the binary directory after several hot reload...) For example, when Visual Studio generate files:

abc.dll
abc.pdb

The sample program will detect abc.dll is updated, it will generate 2 new files:

ab_.dll
ab_.pdb

Where ab_.dll will have a patched PDB path pointing to the newly copied ab_.pdb. And the program will load the ab_.dll instead.

The reason I don't choose a more meaningful name like abc_tmp.dll is because I worry that having a longer file name length than the original name may mess up the offset values stored inside the DLL. So I just replace the last character with an underscore character.

This approach works and every time I start debug without debugger by pressing Ctrl+F5 in Visual Studio, and then edit some code and re-build solution by pressing F7, the DLL get hot-reloaded. When the sample program exit, the ab_.dll and ab_.pdb files get deleted.

However, when the program quit with a debugger attached, the program can't delete the ab_.pdb file...

Second try
We know that the Visual Studio debugger is locking the PDB file, what if when we detect a debugger is attached, can we detach the debugger programmatically before the program exit? Luckily the EnvDTE COM library can help with this task and someone has written sample code to do this (Although that sample code said we need to modify the "VisualStudio.DTE" string to your installed version like "VisualStudio.DTE.14.0", but I have tested with Visual Studio Community 2017 and it works without modification). So, by detaching the debugger programmatically, we can delete the temporary PDB file when program exit.

Third try
Now we can detach debugger programmatically, Why not try re-attach the debugger after every hot re-load? With the re-attach debugger code written, I tried running the program by pressing F5(Start Debugging) and then pressing F7 to re-compile the solution. A dialog pop up:

And I happily press 'Yes' and hope the hot-reload works, do you know what happened? The debugger stopped, but the application also quit... Looks like this approach can only work when using Ctrl-F5(Start without debugger)... I searched for the web for how to disable killing the app when debugger stop, but I can only find people suggest to detach the debugger instead. So I work around this problem by detach the debugger and re-attach it during the program start to avoid the debugger to kill the app when it stop.

So, the hot-reload function is almost working now, just press F5 to start and F7+Enter to re-compile. But sometimes the debugger fail to re-attach to the reloaded app. After spending sometime to investigate the issue, it is due to EnvDTE::Process::Item() function may fail to find the reloaded app process, returning error code RPC_E_CALL_REJECTED. I don't know why this happens, may be the process is busy at reloading the new DLL, so the final work around is to wait a bit and let the process finish their work and re-try it several times.

Fourth try
We know that detaching the debugger will unlock the PDB, what if we just detach the debugger to unlock the PDB, and only copy the newly complied DLL without patching a new PDB path? Unfortunately, it fails and saying that .vcxproj file is locked...

So I can only revert back to use the "Third try" approach...

Last try
We finally have a workable approach to reload the DLL, how about the executable itself? So I tried the "edit and continue" function in Visual Studio. And it works! But only for once... It is because after edit and continue, stopping the debugger will make Visual Studio kill the app... When manually detach the debugger from Visual Studio, it fails with:

So, "edit and continue" function does not compatible with my hot-reload method which relies on detaching the debugger...

Conclusion
In this post, I have described the methods I tried when writing hot-reloadable DLL code on windows. The steps are as follow:

When the program loads a DLL:

1. Copy its associated PDB file.
2. Copy the target DLL file and modify the hard coded PDB path to newly copied PDB path done in step 1.
3. Load the copied DLL in step 2 instead.

After editing some code:

4. Detach the debugger to compile the DLL from Visual Studio.
5. Unload the copied DLL.
6. Repeat the above step 1 to 3.
7. Re-attach the debugger.

From a programmer perspective, steps are:

1. In Visual Studio, press F5 to compile and run the program with debugger.
2. Edit some code, then press F7 to re-build the solution.
3. Press enter to confirm the "Do you want to stop debugging?" dialog.
4. The program will reload the new DLL and re-attach the debugger automatically after compilation.

You can try the above work flow by downloading the sample code. I have only tested it with Visual Studio Community 2017 and may not work with other version of Visual Studio. This method is far from perfect, and if anyone knows a better method and don't require work around, please let me know. Thank you very much!

Reference
[1] https://github.com/RuntimeCompiledCPlusPlus/RuntimeCompiledCPlusPlus/wiki/Alternatives
[2] https://ourmachinery.com/post/dll-hot-reloading-in-theory-and-practice/
[3] https://ourmachinery.com/post/little-machines-working-together-part-2/
[4] https://blog.molecular-matters.com/2017/05/09/deleting-pdb-files-locked-by-visual-studio/
[5] https://github.com/fungos/cr
[6] http://www.debuginfo.com/articles/debuginfomatch.html
[7] https://msdn.microsoft.com/en-us/library/ms809762.aspx
[8] https://handmade.network/forums/wip/t/1479-sample_code_to_programmatically_attach_visual_studio_to_a_process

Simple GPU Path Tracer

2018-06-11T02:12:00.000+08:00

Introduction
Path tracing is getting more popular in recent years. And because it is easy to get the code run in parallel, so making the path tracer to run on GPU can greatly reduce the rendering time. This post is just my personal notes about learning the basic of Path Tracing and to make me familiar with the D3D12 API. The source code can be downloaded here. And for those who don't want to compile from the source, the executable can be downloaded here.

Rendering Equation
Like other rendering algorithm, path tracing is solving the rendering equation:

To solve this integral, Monte Carlo Integration can be used, so we will shoot many rays within a single pixel from the camera position.

During path tracing, when a ray hits a surface, we can accumulate its light emission as well as the reflected light of that surface, i.e. computing the rendering equation. But we only take one sample in the Monte Carlo Integration so that only 1 random ray is generated according to the surface normal, which simplify the equation to:

Since we shoot many rays within a single pixel, we can still get an un-biased result. To expand the recursive path tracing rendering equation, we can derive the following equation:

GPU random number
To compute the Monte Carlo Integration, we need to generate random number on the GPU. The wang_hash is used due to its simple implementation.

uint wang_hash(uint seed)
{
seed = (seed ^ 61) ^ (seed >> 16);
seed *= 9;
seed = seed ^ (seed >> 4);
seed *= 0x27d4eb2d;
seed = seed ^ (seed >> 15);
return seed;
}

We use the pixel index as the input for the wang_hash function.

seed = px_pos.y * viewportSize.x + px_pos.x

However, there are some visible pattern for the random noise texture using this method (although not affecting the final render result much...):

Luckily, to fix this, we can simply multiple a random number for the pixel index which eliminate the visible pattern in the random texture.

seed = (px_pos.y * viewportSize.x + px_pos.x) * 100

To generate multiple random numbers within the same pixel, we can add the random seed by a constant number after each call to the wang_hash function. Any constant larger than 0, (e.g. 10) will be good enough for this simple path tracer.

float rand(inout uint seed)
{
float r= wang_hash(seed) * (1.0 / 4294967296.0);
seed+= 10;
return r;
}

Scene Storage
To trace ray on the GPU, I upload all the scene data(e.g. triangles, material, light...) into several structure buffers and constant buffer. Due to my laziness and the announcement of DirectX Raytracing, I did not implement any ray tracing acceleration structure like BVH. I just store the triangles in a big buffer.

Tracing Rays
By using the rendering equation derived above, we can start writing code to shoot rays from the camera. During each frame, for each pixel, we trace one ray and reflect it multiple times to compute the rendering equation. And then we can additive blend the path traced result over multiple frames to get a progressive path tracer using the following blend factor:

To generate the random reflected direction of any ray hit surface, we simply uniformly sample a direction on the hemi-sphere around surface normal:

Here is the result of the path tracer when using the uniform random direction and using an emissive light material. The result is quite noisy:

Uniform implicit light sampling, 64 sample per pixel

To reduce noise, we can weight the randomly reflected ray with a cosine factor similar to the Lambert diffuse surface:

Cos weighted implicit light sampling, 64 sample per pixel

The result is still a bit noisy. Because in our scene, the light source is not very large, the probability of a randomly reflected ray to hit the light source is quite low. So to improve this, we can explicit sample the light source for every ray that hit a surface.

To sample a rectangular light source, we can randomly choose a point over its surface area, and the corresponding probability will be:

1/area of light

Since our light sampling is over the area domain instead of the direction domain as state in the above equation. The rendering equation need to multiply by the Jacobian that relates solid angle to area. i.e.

With the same number of sample per pixel, the result is much less noisy:

Uniform explicit light sampling, 64 sample per pixel

Cos weighted explicit light sampling, 64 sample per pixel

Simple de-noise
As we have seen above, the result of path tracing is a bit noise even with 64 samples per pixel. The result will be even worse for the first frame:

first frame path traced result

There are some very bright dots and looks not good during camera motion. So I added a simple de-noise pass, which is just blurring lots of pixels where they are located on the same surface (which really need a lot of pixel to make the result looks good, which cost some performance...).

Blurred first frame path traced result

To identify the pixel correspond to which surface, we store this data in the alpha channel of the path tracing texture with the following formula:

dot(surface_normal, float3(1, 10, 100)) + (mesh_idx + 1) * 1000

This works because we only contains small number of mesh and the mesh normal are the same for each surface in this simple scene.

Random Notes...
During the implementation, I encounter various bugs/artifacts which I think is interesting.

First, is about the simple de-noise pass. It may bleed the light source color to neighbor pixel far away even we have per pixel mesh index data.

This is because we only store a single mesh index per pixel, but we jitter the ray shot from camera within a single pixel per frame, some of the light color will be blend to the light geometry edge. It get very noticeable because the light source have a very high radiance compared to the reflect light of ceiling geometry.

To fix this, I just simply do not jitter the ray for tracing a direct hit of light geometry from camera, so this fix can only apply to explicit light sampling.

The second one is about quantization when using 16bit floating point texture. The path tracing texture sometimes may get quantized result after several hundred frames of additive blend when the single sample per pixel path trace result is very noise.

Quantized implicit light sampling

Path traced result in first frame

simple de-noised first frame result

To work around this, 32bit floating point texture need to be used, but this may have a performance impact (explicitly for my simple de-noise pass...).

The last one is the bright flyflies artifact when using a very large light source (as big as ceiling). This may sound counter intuitive. And the implicit light path traced result(i.e. not sampling the light source directly) does not have those flyflies...

Explicit light sample result

Implicit light sample result

But it turns out this artifact is not related to the size of the light source, but is related to the light too close to the reflected geometry. To visualize it, we may look at how the light get bounced:

path trace depth = 1

path trace depth = 2

The flyflies start to appear in first bound, located at the position near the light source. And then those flyflies get propagated with the reflected light rays. Those large values are generated by explicit light sampling Jacobian transform, the denominator part, which is the distance square between the light and surface.

After a brief search on the internet, to fix this, either need to implement radiance clamping or bi-directional path tracing, or greatly increase the sampling number. Here is the result with over 75000 number of samples per pixel, but it still contains some flyflies...

Conclusion
In this post, we discuss the steps to implement a simple GPU path tracer. The most basic path tracer is simply shooting large number of rays per pixel, and reflect the ray multiple times until it hits a light source. With explicit light sampling, we can greatly reduce noise.

This path tracer is just my personal toy project, which only have Lambert diffuse reflection with a single light. It is my first time to use the D3D12 API, the code is not well optimized, so the source code are for reference only and if you find any bugs, please let me know. Thank you.

Reference
[1] Physically Based Rendering http://www.pbrt.org/
[2] https://www.slideshare.net/jeannekamikaze/introduction-to-path-tracing
[3] https://www.slideshare.net/takahiroharada/introduction-to-bidirectional-path-tracing-bdpt-implementation-using-opencl-cedec-2015
[4] http://reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/

Render Passes in "Seal Guardian"

2017-12-14T01:14:00.000+08:00

Introduction
"Seal Guardian" uses a forward renderer to render the scene. Because we need to support mobile platform, we don't have too many effect in it. But still it consists of a few render passes to compose an image.

Shadow Map Pass
To calculate dynamic shadow of the scene, we need to render the depth of the meshes from the light point of view. We render them into a 1024x1024 shadow map.

Standard shadow map

Then we use the Exponential Shadow Map method to blur the shadow map into a 512x512 shadow map.

ESM blurred shadow map

(Note that this pass may be skipped according to current performance setting.)

Opaque Geometry Pass
In this pass, we render the scene meshes into a RGBA8 render target. We compute all the lighting including direct lighting, indirect lighting(lightmap or SH probe), tone mapping in this single pass. This is because on iOS, reducing render pass may have a better performance, so we choose to combine all the calculation into a single pass.

Tonemapped opaque scene color

Opaque geometry depth bufer

To reduce the impact of overdraw, we pre-compute a visibility set to avoid drawing occluded mesh (may talk about it in future post). Also we want to add a bloom pass to enhance the effect of bright pixels, we compute a bloom value in this pass according to the pre-tone mapped value and store it in the alpha channel of this pass.

Transparent Geometry Pass
In this pass, we render transparent mesh and particle. We blend the post-tonemapped color with the opaque geometry due to performance reason. Also, because we store the bloom intensity in the alpha channel and we want the alpha geometry to affect the bloom result. We solve this by 2 different methods depending on the game runs on which platform.

On iOS, we render the mesh directly to the render target of the opaque geometry pass with a shader similar to the opaque pass by outputting tonemapped scene color in RGB and bloom intensity in A. To blend those 4 values over the opaque value, we use the EXT_shader_framebuffer_fetch OpenGL extension. So the blending happens at the end of the transparent geometry shader and we choose the simple blending formula below by using the opacity of the mesh(because we want to make it consistent with other platform):

RGB= mesh color * mesh alpha + dest color * (1 - mesh alpha)
A = mesh bloom intensity * mesh alpha + dest bloom intensity * (1 - mesh alpha)

On Windows and Mac, the EXT_shader_framebuffer_fetch does not exist. We render all the transparent meshes into a separate RGBA8 render target. We compute the scene color and bloom intensity similar to opaque pass, but before writing to the render target, we decompose the RGB scene color into luma and chroma and store the chroma value in checkerboard pattern similar to this paper(slide 104). So we can store luma+chroma in RG channel, bloom intensity in B channel and opacity of mesh in the A channel of the render target.

Transparent render target on Windows platform

Finally, we can blend this transparent texture over the opaque geometry pass render target.

Composed opaque and transform geometry

Post Process Pass
After those geometry passes, we can blend in the bloom filter. We make several blur passes for those bright pixels and additive blend over the previous render pass output to enhance the bright effect.

Blurred bright pixels

Additive blended bloom texture with scene color

Then we compute a simplified(but not very accurate, due to the lack of a velocity buffer) temporal anti-aliasing using the color and depth buffer of current frame and previous 2 frames. One thing we didn't mention is that, during rendering the opaque and transparent meshes, we jitter the camera projection by half a pixel, alternating between odd and even frame, similar to the figure below, so that we can have sub-pixel information for anti-aliasing.

Temporal AA jitter pattern

Temporal anti-aliased image

Conclusion
In this post, we break down the render passes in "Seal Guardian", which compose of mainly 4 parts: shadow map, opaque geometry, transparent geometry and post process passes. By making less render pass, we can achieve a constant 60FPS in most cases (if target framerate is not met, we may skip some render pass such as temporal AA and shadow).

Lastly, "Seal Guardian" has already been released on Steam / Mac App Store / iOS App Store. If you want to support us to develop games with custom tech, then buying a copy of the game on any platform will help. Thank you very much.

References
[1] The Art and Technology behind Crysis 3 http://www.crytek.com/download/fmx2013_c3_art_tech_donzallaz_sousa.pdf

Shadow in "Seal Guardian"

2017-12-06T00:18:00.000+08:00

Introduction
"Seal Guardian" uses a mix of static and dynamic shadow systems to support long range shadow to cover the whole level. "Seal Guardian" only use a single directional for the whole level, so part of the shadow information can be pre-computed. It mainly consists of 3 parts: baked static shadow on static meshes stored along with the light map, baked static shadow for dynamic objects stored along with the irradiance volume and dynamic shadow with optional ESM soft shadow.

Static shadow for static objects
During the baking process of the light map, we also compute static shadow information. We first render a shadow map for the whole level in a big render target (e.g. 8192x8192), then for each texel of light map, we can compare against its world position to the shadow map to check whether that texel is in shadow. But we are using a 1024x1024 light map for the whole scene, storing the shadow term directly will not have enough resolution. So we use distance field representation[1] to reduce storage size similar to the UDK[2]. To bake the distance field representing of the shadow term, instead of comparing a single depth value at texel world position as before, we compare several values within a 0.5m x 0.5m grid, oriented along the normal at position similar to the figure below:

Blue dots indicate the positions for sampling shadow map
to compute distance field value for the texel at red dot position.
(The gird is perpendicular to the red vertex normal of the texel.)

By doing this, we can get the shadow information around the baking texel to compute the distance field. We choose this method instead of computing the distance field from a large baked shadow texture because we want to have the shadow distance filed consistently computed in world space no matter how the mesh UV is and this can also avoid UV seam too. But this method may cause potential problem for concave mesh, but so far, for all levels in "Seal Guardian", it is not a big problem.

Static shadow only

Static shadow for dynamic objects
For dynamic objects to receive baked shadow, we baked shadow information and store it along with the irradiance volume. For each irradiance probe location, we compare it to the whole scene shadow map and get a binary shadow value. During runtime, we interpolate this binary shadow value by using the position of dynamic object and the probe location to get a smooth transition of shadow value, just like interpolating the SH coefficients of irradiance volume.

Circled objects does not have light map UV, so they are treated the same as dynamic objects and shadowed with the shadow value stored along with irradiance volume

Each small sphere is a sampling location for storing the SH coefficients and shadow value of the irradiance for dynamic objects.

Dynamic Shadow
We use standard shadow mapping algorithm with exponential shadow map(ESM)[3] to support dynamic shadow in "Seal Guardian". However due to we need to support a variety of hardware(from iOS, Mac to PC) and minimise code complexity, we choose not to use any cascade shadow map. Instead we use a single shadow map to support dynamic shadow for a very short distance (e.g. 30m-60m) and rely on baked shadow to cover the remaining part of the scene.

Dynamic shadow mixed with static shadow

Dynamic shadow only

Shadow Quality Settings
With the above systems, we can make a few shadow quality settings:

mix of static shadow with dynamic ESM shadow
mix of static shadow with dynamic hard shadow
static shadow only

On iOS platform, we choose the shadow quality depends on the device capability. Besides, as we are using a forward renderer, when we are drawing objects that outside the dynamic shadow distance, those objects can use the static shadow only shader to save a bit of performance.

Soft Shadow

Hard Shadow

No Shadow

Conclusion
We have briefly describe the shadow system in "Seal Guardian", which uses distance field shadow map for static mesh shadow, interpolated static shadow value for dynamic objects and ESM dynamic shadow for a short distance. Also a few shadow quality settings can be generated with very few coding effort.

Lastly, if you are interested in "Seal Guardian", feel free to check it out and its Steam store page is live now. It will be released on 8^thDec, 2017 on iOS/Mac/PC. Thank you.

References
[1] http://www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
[2] https://docs.unrealengine.com/udk/Three/DistanceFieldShadows.html
[3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.177&rep=rep1&type=pdf

Light Map in "Seal Guardian"

2017-11-29T23:41:00.001+08:00

Introduction
Light mapping is a common technique used in games for storing lighting data. "Seal Guardian" used light map in order to support large variety of hardware from iOS, Mac to PC because of its low run time cost. There are many methods to bake the light map such as photon mapping and radiosity. Our baking method is similar to radiosity hemicube[1], but we render a full cube map for each light map texel to store incoming lighting data instead.

Scene with light map

Scene without light map

Light Map Atlas
In each level, light map is built for all static meshes with a second unique UV set. We gather all those static meshes and pack them into a large light map atlas by using this method[2], others method can be chosen, we just pick a simple one.

Packing a single large light map atlas for all static mesh in the scene

Compute Light Map Texel Position
Then we render all the meshes into a RGBA32Float world position render target using the light map atlas layout created before (by a vertex shader which transform the mesh 3D world position vertex to its unique 2D light map UV). Then we query back the render target to store all the written texels which correspond to the world position of each light map texel. Those position will be used for rendering cube maps for radiosity.

Each square represent a single light map texel,
we query back those texel world space position to render cube map for radiosity

Radiosity Baking
As talked before, we use a method similar to hemicube, but rendering a full cube map instead, so we will render a cube map at each light map texel with all the post processing effect/tone mapping off and just storing the lighting data. Because our light map is intended to store the incoming indirect static lighting for each texel, we convert the incoming lighting data cube map rendered at each texel to 2nd order spherical harmonics coefficients(i.e. 4 coefficients for each RGB channels), the conversion method can be found in "Stupid Spherical Harmonics (SH) Tricks"[3]. So we will need 1 RGBA32Float(or RGBA16Float) cube map and 3 temporal RGBA32Float(or RGBA16Float) textures for each radiosity iteration.

No light map, direct lighting and emissive materials only

Lighting without albedo texture

Radiosity pass 1
In the first pass, we render all the meshes without analytical light source (e.g. directional light) into the cube map. Only the emissive material such as sky and static light placed in the scene get rendered to inject the initial lighting into the radiosity iterations. We support sphere and box shape static light which get rendered into the cube just like an emissive mesh during rendering the cube map. Once the cube map render is completed, we convert the cube map to SH coefficients and store the values. After all the texels are rendered, we will have an incoming lighting light map from emissive mesh and static light source ready for the next pass.

Light map baked with the emissive sky material

Lighting with light map using the emissive sky

Radiosity pass 2
In the second pass, we render all the meshes only with analytical light source and the SH light map from previous pass into a new cube map to calculate the first bound incoming lighting. Then convert the cube map to SH coefficients. After all the texels are rendered and converted to an SH light map, we need to sum this SH light map to the previous pass SH light map to get the accumulated lighting for pass 1 and 2 into another 3 accumulated SH light maps for our final storage (this accumulated light map is not used in the radiosity iteration, just for final radiosity output.).

Light map baked with direct lighting and emissive material

Lighting with light map using direct lighting and emissive sky

Radiosity pass >= 3
For the sub-sequence passes, we can use the SH light map from previous iteration to render the cube map and repeat the conversion to SH and accumulate SH lighting for all passes steps to get the incoming indirect lighting for each light map texels.

Final baked result, showing both direct and indrect lighting

Lighting using light map, without albedo texture

Storage Format
To store the light map data for runtime and reduce memory usage (3 SH light maps in float format, i.e. 12 values for each texels, is too much data to store...), we decompose the the incoming lighting color data to luma and chroma. We only store the luma data in SH format and compute an average chroma value by integrating the SH RGB incoming lighting data with a SH cosine transfer function along the static mesh normal direction, this will get a reflected Lambertian lighting and we use this value to compute the chroma value. By doing this, we can preserve the directional variation of indirect lighting, keeping an average color of incoming lighting and reduce the light map storage to 6 values per texels. To further reduce the memory usage, we clamp the incoming SH luma value to a predefined range so that we can store it in 8-bit texture. However, using compression like DXT will result in artifacts, so we just store the light map data in 2 RGBA8 textures.

Final light map used in run-time, storing SH luma and average chroma

Conclusion
In this post, we have briefly outlined how the light maps are created in "Seal Guardian". It is based on a modified version of radiosity hemicube and using SH as an intermediate representation for baking and reduce the storage size (by splitting the lighting data to luma and chroma.). We skipped some of the baking details like padding the lighting data for each UV shell in each radiosity iteration to avoid light leaking from empty light map texels. Also "Seal Guardian" is rendered using PBR, that means we have metallic material which doesn't work well with radiosity. Instead of converting the metallic to a diffuse material, we pre-filter all the environment probe in each radiosity pass to get the lighting for metallic material. Also, we would like to improve the light map baking in the future, like improving the baking time, fixing the compression problem, we may try BC6H (but need to find another method for iOS compression...), or using a smaller texture size for chroma light map than the luma SH light map texture...

Lastly, if you are interested in "Seal Guardian", feel free to check it out and its Steam store page is live now. It will be released on 8^thDec, 2017 on iOS/Mac/PC. Thank you.

The yellow light bounce on the floor is done by the yellow metallic wall with pre-filtering the environment map in each radiosity pass

Showing only the indirect lighting

References
[1] https://www.siggraph.org/education/materials/HyperGraph/radiosity/overview_2.htm
[2] http://blackpawn.com/texts/lightmaps/
[3] http://www.ppsloan.org/publications/StupidSH36.pdf

"Seal Guardian" announced!!!

2017-11-23T08:18:00.000+08:00

Finally, "Seal Guardian" is announced!!!

It has been a very long time since my last post, I was busy with making the game "Seal Guardian".

"Seal Guardian" is a hard core hack and slash action game, powered by the engine described in this blog. It took me more than 5 years to code the engine(with some help of open source libraries like Bullet physics, Lua and DLMalloc.)/gameplay and creating all the visual artwork from modelling, texturing, skinning, rigging and animation... The game will be available on 8^th December, 2017 on iOS/Mac/PC via iOS App Store/Mac App Store/Steam.

Finishing a game takes lots of effort, patient and time, especially making the whole game on your own. It contains lots of fun tasks like rendering and gameplay, but it contains even more boring tasks like game menu, localisation, UI(e.g. handling all the input UI for mouse/keyboard, touch screen, different gamepad type like PS4/XBox/MFi in different languages for different resolution), making website, prepare online store artwork like app icon, trailer video and screen shots(where different stores have different resolution requirements... :S), opening bank account(which may take more than a month for a small indie game company.). Hope I can share these in future blog posts.

Before sharing the game's postmortem and waiting for its release on 8^th December, 2017. I will write some more blog posts about the engine tech used in "Seal Guardian", e.g. light map baking processing and its storage format, how the static shadow is baked, visibility system, the cross platform rendering pipeline... So, stay tuned!

In the mean time, feel free to visit the "Seal Guardian" Steam store page and share it if you like.
Thank you very much!!! =]

Pre-Integrated Skin Shading

2015-02-07T08:50:00.000+08:00

Introduction
Recently, I was implementing skin shading in my engine. The pre-integrated skin shading technique is chosen because it has a low performance cost and does not require another extra pass. The idea is to pre-bake the scattering effect over a ring into a texture for different curvature to look up during run-time. More information can be found in the GPU Pro 2, the SIGGRAPH slides, and also in the presentation of the game "The Oder:1886". Here is the result implemented in my engine (all screen shots are rendered with filmic tone mapping):

The head on the left is lit with Oren-Nayar shading
The head on the right is lit with pre-integrated skin

Curve Approximation for Direct Lighting
In my engine, iOS is one of my target platform, which only have 8 texture unit to use for the OpenGL ES 2.0 API. This is not enough for the pre-integrated skin look up texture because my engine have already used some slots for light map, shadow map, IBL... So I need to find an approximation to the look up texture.

Unfortunately I don't have a Mathematica license at home. So I think may be I can fit the curve manually by inspecting the shape of the curve. So, I started by plotting the graph of the equation:

Here is the shape of the red channel diffusion curve, plotting with N.L(normalized to [0, 1])and r (from 2 to 16):

My idea for approximating the curve is by finding some simple curves first and then interpolate between them like this:

For the light blue line in the above figure, a single straight line can get a close enough approximation:

curve1= saturate(1.95 * NdotL -0.96)

To approximate the dark blue line, I divide it into 2 parts: linear part and quadratic part:

curve0_linear= saturate(1.75* NdotL -0.76)

curve0_quadratic= 0.65*(NdotL^ 2) + 0.045

blending both linear and quadratic curve will get a curve that is similar to the original function:

curve0= lerp(curve0_quadratic, curve0_linear, NdotL^2)

Now we have 2 curves that is similar to our original function at both end. By mixing them together, we can get something similar to the original function like this:

curve= lerp(curve0, curve1, 1 - (1 - curvature)^4)

By repeating the above steps for the blue and green channels, we can shade the pre-integrated skin without the look up texture. Here is the result:



Lit with a single directional light From left to right: r = 2, 4, 8, 16 Upper row: shaded with look up texture Lower row: shaded with approximated function

This is how it looks like when applying to a human head model:



From left to right: shaded with look up texture, approximated function, lambert shader Upper row: shaded with albedo texture applied Lower row: showing only lighting result

For your reference, here is the approximated function I used for the RGB channels:

NdotL = mad(NdotL, 0.5, 0.5); // map to 0 to 1 range
float curva = (1.0/mad(curvature, 0.5 - 0.0625, 0.0625) - 2.0) / (16.0 - 2.0); // curvature is within [0, 1] remap to normalized r from 2 to 16
float oneMinusCurva = 1.0 - curva;
float3 curve0;
{
float3 rangeMin = float3(0.0, 0.3, 0.3);
  float3 rangeMax = float3(1.0, 0.7, 0.7);
  float3 offset = float3(0.0, 0.06, 0.06);
  float3 t = saturate( mad(NdotL, 1.0 / (rangeMax - rangeMin), (offset + rangeMin) / (rangeMin - rangeMax) ) );
  float3 lowerLine = (t * t) * float3(0.65, 0.5, 0.9);
  lowerLine.r += 0.045;
  lowerLine.b *= t.b;
  float3 m = float3(1.75, 2.0, 1.97);
  float3 upperLine = mad(NdotL, m, float3(0.99, 0.99, 0.99) -m );
  upperLine = saturate(upperLine);
  float3 lerpMin = float3(0.0, 0.35, 0.35);
  float3 lerpMax = float3(1.0, 0.7 , 0.6 );
  float3 lerpT = saturate( mad(NdotL, 1.0/(lerpMax-lerpMin), lerpMin/ (lerpMin - lerpMax) ));
  curve0 = lerp(lowerLine, upperLine, lerpT * lerpT);
}
float3 curve1;
{
  float3 m = float3(1.95, 2.0, 2.0);
  float3 upperLine = mad( NdotL, m, float3(0.99, 0.99, 1.0) - m);
  curve1 = saturate(upperLine);
}
float oneMinusCurva2 = oneMinusCurva * oneMinusCurva;
float3 brdf = lerp(curve0, curve1, mad(oneMinusCurva2, -1.0 * oneMinusCurva2, 1.0) );

Curve Approximation for Indirect Lighting
In my engine, the indirect lighting is stored in spherical harmonics up to order 2. So the pre-integrated skin BRDF need to be projected into spherical harmonics coefficients and can be store in look up texture just like the presentation of "The Oder:1886" described. One thing to note about the integral range in equation (19) from the paper should be up to π instead of π/2 because to project the coefficients, we need to integrate over the whole sphere domain and the value of D(θ, r) in the lower hemi-sphere may be non zero for small r due to sub-surface scattering, which does not like the clamped cos(θ). So I compute the spherical harmonics coefficient using this:

To make the indirect lighting work on my target platform, an approximate function to this indirect lighting look up texture also need to be found. By using similar methods described above and with some trail and error. Here is my result:



Lit with both directional light and BRDF projected into SH From left to right: r = 2, 4, 8, 16 Upper row: shaded with look up texture Lower row: shaded with approximated function

And applying it to the human head model, but this time the approximation is not close enough and lose a bit of red color:



From left to right: shaded with look up texture, approximated function, lambert shader Upper row: shaded with albedo texture applied Lower row: showing only lighting result

And some code for your reference, where the ZH is the zonal harmonics coefficient:

float curva = (1.0/mad(curvature, 0.5 - 0.0625, 0.0625) - 2.0) / (16.0 - 2.0); // curvature is within [0, 1] remap to r distance 2 to 16
float oneMinusCurva = 1.0 - curva;
// ZH0
{
float2 remappedCurva = 1.0 - saturate(curva * float2(3.0, 2.7) );
  remappedCurva *= remappedCurva;
  remappedCurva *= remappedCurva;
  float3 multiplier = float3(1.0/mad(curva, 3.2, 0.4), remappedCurva.x, remappedCurva.y);
  zh0 = mad(multiplier, float3( 0.061659, 0.00991683, 0.003783), float3(0.868938, 0.885506, 0.885400));
}
// ZH1
{
  float remappedCurva = 1.0 - saturate(curva * 2.7);
float3 lowerLine = mad(float3(0.197573092, 0.0117447875, 0.0040980375), (1.0f - remappedCurva * remappedCurva * remappedCurva), float3(0.7672169, 1.009236, 1.017741));
  float3 upperLine = float3(1.018366, 1.022107, 1.022232);
  zh1 = lerp(upperLine, lowerLine, oneMinusCurva * oneMinusCurva);
}

Result
Putting both the direct and indirect lighting calculation together, with a simple GGX specular, lit by 1 directional light, SH projected indirect light and pre-filtered IBL.



Heads from left to right: shaded with lambert shader, approximated function, look up texture Images from left to right: full shading, direct lighting only, indirect lighting only Upper row: shaded with albedo texture applied Lower row: showing only lighting result

With another lighting environment:



Heads from left to right: shaded with look up texture, approximated function, lambert shader Images from left to right: full shading, direct lighting only, indirect lighting only Upper row: shaded with albedo texture applied Lower row: showing only lighting result

I have uploaded the curvature map for the human head here, look up texture for the direct lighting here and indirect lighting look up texture here. The textures need to be loaded without sRGB conversion. For the indirect lighting texture, I have scale the value so that it fits into an 8-bit texture within 0 to 1 range. So a sample use of the look up textures looks like:

float3 brdf= directBRDF.Sample(samplerLinearClamp, float2(mad(NdotL, 0.5, 0.5), oneOverR)).rgb;
float3 zh0= indirectBRDF_ZH.Sample(samplerLinearClamp, float2(oneOverR, 0.25)).rgb;
float3 zh1= indirectBRDF_ZH.Sample(samplerLinearClamp, float2(oneOverR, 0.75)).rgb;
float remapMin= 0.75;
float remapMax= 1.05;
zh0= zh0 * (remapMax - remapMin) + remapMin;
zh1= zh1 * (remapMax - remapMin) + remapMin;

Conclusion
In this post, I describe a way to find an approximated function for the pre-integrated skin diffusion profile, which gives a similar result for the direct lighting function while losing a bit of red color for the indirect lighting. The down side of fitting the curve manually is when the function is changed a bit, say changing the function input from radial distance to curvature(i.e. from r to 1/r), all the approximate functions need be re-do again (or the conversion need to be done during run-time just like my code snippet above...). Also the shadow scattering described in the original paper is not implemented, so some artifact may be seen at the direct shadow boundary. Overall, the skin shading result is improved compare to shade with Lambert or Oren-Nayar under environment with a strong directional light source.

Reference
[1] SIGGRAPH 2011- Pre-Integrated Skin Shading
http://advances.realtimerendering.com/s2011/Penner%20-%20Pre-Integrated%20Skin%20Rendering%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx
[2] GPU Pro 2- Pre-Integrated Skin Shading http://www.amazon.com/GPU-Pro-2-Wolfgang-Engel/dp/1568817185
[3] Crafting a Next-Gen Material Pipeline for The Order: 1886 http://blog.selfshadow.com/publications/s2013-shading-course/rad/s2013_pbs_rad_notes.pdf
[4] GPU Gem 3- Advanced Techniques for Realistic Real-Time Skin Rendering http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html
[5] Mathematica and Skin Rendering http://c0de517e.blogspot.jp/2011/09/mathematica-and-skin-rendering.html
[6] Addendum to Mathematica and Skin Rendering http://c0de517e.blogspot.jp/2011/09/mathematica-and-skin-rendering.html
[7] 3D head model by Infinite-Realities http://graphics.cs.williams.edu/data/meshes/head.zip

Recent Update 2014

2014-10-01T17:28:00.001+08:00

Overview
It has been a long time since my last post. Life was not that good here in Hong Kong, but I was still working on my engine in spare time. This post will show some of the stuff I have been working on in the past few months.

Shading
On graphics side, the engine is now switched to use physically based shaders with GGX model for specular and Oren-Nayar model for diffuse shading. The GGX model is implemented according to this paper from Unreal Engine 4[1]. While the Oren-Nayar model is implemented according to this paper from tri-Ace[2].

Meshes shaded with physically based shaders with different roughness

Oren-Nayar model is chosen over Lambert model because it takes roughness into account. During shading, for those mesh with roughness map but without normal map, using Oren-Nayar model will show a bit more details.


A mesh shaded without normal map under indirect lighting: From left to right: final result, lighting only, normal, roughness, albedo

Also, a radiosity[3] baking tool was written to calculate the indirect diffuse lighting for static mesh by rendering cube-maps at every light map texel position. The cube maps are then projected into Spherical Harmonics(SH) during the radiosity iteration. And the results are stored in SH luma + average chroma along the vertex normal hemi-sphere (but the chroma format doesn't play well with the texture compression, so may need to find another storage representation in the future...).

Direct + Indirect lighting

Direct lighting only

Lighting only

The indirect specular lighting is baked by capturing reflection probe and pre-filtered with GGX model according to the Unreal Engine 4 paper[1]. Currently only the closest reflection probe is used without parallax correction.

Reflection probes placed within the scene

Shading with pre-filtered cube-map

The indirect diffuse lighting for dynamic mesh is baked into the Irradiance Volumes[4], storing in SH coefficients. The SH coefficients are modified according to roughness described in the tri-Ace paper[2] (This is also done for the SH luma store in the light map for static mesh).

Irradiance Volume samples

Dragon shaded with Irradiance Volume

Shadow
The shadow system is updated to use with a mix of baked and real-time shadow. The light map baker described above calculate a shadow term for the main directional light at each light map texel and stored using signed distance field representation[5][6] (Also, an additional binary shadow term is calculated for each Irradiance Volume sample position for dynamic objects).

Real-time shadow only

baked shadow only

Baked + real-time shadow

For the dynamic soft shadow, Exponential Shadow Map(ESM)[7] is used. But the contact shadow looks too soft, so the shadow term calculated by ESM is raised to the power 4 (i.e. s'= s^4, where s is the shadow attenuation result from ESM) to make it darker.

Potential Visibility Set
A potential visibility set (PVS) baker is written to calculate which meshes are visible to the visibility cells for culling the scene during runtime. A brute force approach is used for baking by taking some sample positions on a given mesh (e.g. vertex position + light map texel position) and then rendering the scene to check whether the visibility cells are visible.

Render without PVS culling

Render with PVS culling

Final rendered result

Another view for the same camera render without PVS culling

Another view for the same camera render with PVS culling

Visibility Cells are placed within the scene for possible camera location

Particles
A basic particle system is implemented which can receive static lighting and receive static shadow. The particle are shaded on CPU using the Irradiance Volumes described above and can be calculated either per vertex, per particle or per emitter.

Particles with lighting and shadow enabled

Self-shadow disabled

The particles can also receive self shadow using the Fourier Opacity Mapping[8]. An opacity map is computed on CPU side for the main directional light which assume each particle is of sphere shape. Then a shadow term can be computed for shading.

Cross platform support
The engine runtime supports 3 different platforms: Windows, Mac, iOS(Editors are Windows only). On Windows, the engine mainly runs with D3D11. An extra OpenGL wrapper was written for Windows to ease the porting to iOS and Mac because it is easier to debug OpenGL with tools like Nsight.

The engine runs on Windows, Mac, iOS platforms

To write cross platform shaders, shaders are written in my own shading language which is similar to HLSL. Those shaders are then parsed by a shader parser generated by Flex and Bison[9] to obtain a syntax tree to output the actual HLSL and GLSL source code.

Final words
Hope you enjoy the above screen shots and wish I can find some time to describe them in details in future posts.

There are some other tasks implemented in the past few months,
such as various editors, audio coding as well as asset hot-reload

Reference
[1] Real Shading in Unreal Engine 4 http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf
[2] Beyond a Simple Physically Based Blinn-Phong Model in Real-time http://research.tri-ace.com/Data/s2012_beyond_CourseNotes.pdf
[3] Radiosity http://en.wikipedia.org/wiki/Radiosity_(computer_graphics)
[4] Irradiance Volumes for Games http://developer.amd.com/wordpress/media/2012/10/Tatarchuk_Irradiance_Volumes.pdf
[5] Improved Alpha-Tested Magnification for Vector Textures and Special Effects http://www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
[6] Distance Field Shadows http://udn.epicgames.com/Three/DistanceFieldShadows.html
[7] Exponential Shadow Maps http://research.edm.uhasselt.be/tmertens/papers/gi_08_esm.pdf
[8] Fourier Opacity Mapping http://sebastien.hillaire.free.fr/index.php?option=com_content&view=article&id=61&Itemid=72
[9] Flex and Bison http://aquamentus.com/flex_bison.html