Testing Hot-Reload DLL on Windows

After finishing the game Seal Guardian and taking some rest, I was recently back to refactoring the engine code of the game Seal Guardian. In this game, the engine has the ability to hot-reload all asset type from texture, shader, game level to Lua script. But it lacks the ability to hot-reload C/C++ files. So I decided to spend some time on finding resources about hot reload C/C++. It turns out hot-reload C/C++ is not that trivial on Windows as the PDB file is locked. And I found this approach of patching the PDB path inside DLL looks interesting. So I gave it a try and the sample program is uploaded to  here (only tested with Visual Studio Community 2017).

First try
Because the PDB file path is hard coded inside the DLL file, the approach used by cr.h is to correctly parse the DLL file to find the PDB file path and replace it with another new file path according to the Portable Executable format.

So I tried something similar, but different from cr.h, instead of generate a new DLL/PDB file name every time the DLL get re-compiled, I use a fixed temporary name (I don't want to have many random files inside the binary directory after several hot reload...) For example, when Visual Studio generate files:
  • abc.dll
  • abc.pdb
The sample program will detect abc.dll is updated, it will generate 2 new files:
  • ab_.dll
  • ab_.pdb
Where ab_.dll will have a patched PDB path pointing to the newly copied ab_.pdb. And the program will load the ab_.dll instead.

The reason I don't choose a more meaningful name like abc_tmp.dll is because I worry that having a longer file name length than the original name may mess up the offset values stored inside the DLL. So I just replace the last character with an underscore character.

This approach works and every time I start debug without debugger by pressing Ctrl+F5 in Visual Studio, and then edit some code and re-build solution by pressing F7, the DLL get hot-reloaded. When the sample program exit, the ab_.dll and ab_.pdb files get deleted.

However, when the program quit with a debugger attached, the program can't delete the ab_.pdb file...

Second try
We know that the Visual Studio debugger is locking the PDB file, what if when we detect a debugger is attached, can we detach the debugger programmatically before the program exit? Luckily the EnvDTE COM library can help with this task and someone has written sample code to do this (Although that sample code said we need to modify the "VisualStudio.DTE" string to your installed version like "VisualStudio.DTE.14.0", but I have tested with Visual Studio Community 2017 and it works without modification). So, by detaching the debugger programmatically, we can delete the temporary PDB file when program exit.

Third try
Now we can detach debugger programmatically, Why not try re-attach the debugger after every hot re-load? With the re-attach debugger code written, I tried running the program by pressing F5(Start Debugging) and then pressing F7 to re-compile the solution. A dialog pop up:

And I happily press 'Yes' and hope the hot-reload works, do you know what happened? The debugger stopped, but the application also quit... Looks like this approach can only work when using Ctrl-F5(Start without debugger)... I searched for the web for how to disable killing the app when debugger stop, but I can only find people suggest to detach the debugger instead. So I work around this problem by detach the debugger and re-attach it during the program start to avoid the debugger to kill the app when it stop.

So, the hot-reload function is almost working now, just press F5 to start and F7+Enter to re-compile. But sometimes the debugger fail to re-attach to the reloaded app. After spending sometime to investigate the issue, it is due to EnvDTE::Process::Item() function may fail to find the reloaded app process, returning error code RPC_E_CALL_REJECTED. I don't know why this happens, may be the process is busy at reloading the new DLL, so the final work around is to wait a bit and let the process finish their work and re-try it several times.

Fourth try
We know that detaching the debugger will unlock the PDB, what if we just detach the debugger to unlock the PDB, and only copy the newly complied DLL without patching a new PDB path? Unfortunately, it fails and saying that .vcxproj file is locked...

So I can only revert back to use the "Third try" approach...

Last try
We finally have a workable approach to reload the DLL, how about the executable itself? So I tried the "edit and continue" function in Visual Studio. And it works! But only for once... It is because after edit and continue, stopping the debugger will make Visual Studio kill the app... When manually detach the debugger from Visual Studio, it fails with:

So, "edit and continue" function does not compatible with my hot-reload method which relies on detaching the debugger...

In this post, I have described the methods I tried when writing hot-reloadable DLL code on windows. The steps are as follow:

When the program loads a DLL:
1. Copy its associated PDB file.
2. Copy the target DLL file and modify the hard coded PDB path to newly copied PDB path done in step 1.
3. Load the copied DLL in step 2 instead.
After editing some code:
4. Detach the debugger to compile the DLL from Visual Studio.
5. Unload the copied DLL.
6. Repeat the above step 1 to 3.
7. Re-attach the debugger.
From a programmer perspective, steps are:
1. In Visual Studio, press F5 to compile and run the program with debugger.
2. Edit some code, then press F7 to re-build the solution.
3. Press enter to confirm the "Do you want to stop debugging?" dialog.
4. The program will reload the new DLL and re-attach the debugger automatically after compilation.
You can try the above work flow by downloading the sample code. I have only tested it with Visual Studio Community 2017 and may not work with other version of Visual Studio. This method is far from perfect, and if anyone knows a better method and don't require work around, please let me know. Thank you very much!

[1] https://github.com/RuntimeCompiledCPlusPlus/RuntimeCompiledCPlusPlus/wiki/Alternatives
[2] https://ourmachinery.com/post/dll-hot-reloading-in-theory-and-practice/
[3] https://ourmachinery.com/post/little-machines-working-together-part-2/
[4] https://blog.molecular-matters.com/2017/05/09/deleting-pdb-files-locked-by-visual-studio/
[5] https://github.com/fungos/cr
[6] http://www.debuginfo.com/articles/debuginfomatch.html
[7] https://msdn.microsoft.com/en-us/library/ms809762.aspx
[8] https://handmade.network/forums/wip/t/1479-sample_code_to_programmatically_attach_visual_studio_to_a_process

Simple GPU Path Tracer

Path tracing is getting more popular in recent years. And because it is easy to get the code run in parallel, so making the path tracer to run on GPU can greatly reduce the rendering time. This post is just my personal notes about learning the basic of Path Tracing and to make me familiar with the D3D12 API. The source code can be downloaded here. And for those who don't want to compile from the source, the executable can be downloaded here.

Rendering Equation
Like other rendering algorithm, path tracing is solving the rendering equation:

To solve this integral, Monte Carlo Integration can be used, so we will shoot many rays within a single pixel from the camera position.

During path tracing, when a ray hits a surface, we can accumulate its light emission as well as the reflected light of that surface, i.e. computing the rendering equation. But we only take one sample in the Monte Carlo Integration so that only 1 random ray is generated according to the surface normal, which simplify the equation to:

Since we shoot many rays within a single pixel, we can still get an un-biased result. To expand the recursive path tracing rendering equation, we can derive the following equation:

GPU random number
To compute the Monte Carlo Integration, we need to generate random number on the GPU. The wang_hash is used due to its simple implementation.
  1. uint wang_hash(uint seed)
  2. {
  3.     seed = (seed ^ 61) ^ (seed >> 16);
  4.     seed *= 9;
  5.     seed = seed ^ (seed >> 4);
  6.     seed *= 0x27d4eb2d;
  7.     seed = seed ^ (seed >> 15);
  8.     return seed;
  9. }
We use the pixel index as the input for the wang_hash function.
seed = px_pos.y * viewportSize.x + px_pos.x
However, there are some visible pattern for the random noise texture using this method (although not affecting the final render result much...):

Luckily, to fix this, we can simply multiple a random number for the pixel index which eliminate the visible pattern in the random texture.
seed = (px_pos.y * viewportSize.x + px_pos.x) * 100 

To generate multiple random numbers within the same pixel, we can add the random seed by a constant number after each call to the wang_hash function. Any constant larger than 0, (e.g. 10) will be good enough for this simple path tracer.
  1. float rand(inout uint seed)
  2. {
  3.     float r= wang_hash(seed) * (1.0 / 4294967296.0);
  4.     seed+= 10;
  5.     return r;
  6. }
Scene Storage
To trace ray on the GPU, I upload all the scene data(e.g. triangles, material, light...) into several structure buffers and constant buffer. Due to my laziness and the announcement of DirectX Raytracing, I did not implement any ray tracing acceleration structure like BVH. I just store the triangles in a big buffer.

Tracing Rays
By using the rendering equation derived above, we can start writing code to shoot rays from the camera. During each frame, for each pixel, we trace one ray and reflect it multiple times to compute the rendering equation. And then we can additive blend the path traced result over multiple frames to get a progressive path tracer using the following blend factor:

To generate the random reflected direction of any ray hit surface, we simply uniformly sample a direction on the hemi-sphere around surface normal:

Here is the result of the path tracer when using the uniform random direction and using an emissive light material. The result is quite noisy:

Uniform implicit light sampling, 64 sample per pixel

To reduce noise, we can weight the randomly reflected ray with a cosine factor similar to the Lambert diffuse surface:

Cos weighted implicit light sampling, 64 sample per pixel
The result is still a bit noisy. Because in our scene, the light source is not very large, the probability of a randomly reflected ray to hit the light source is quite low. So to improve this, we can explicit sample the light source for every ray that hit a surface.

To sample a rectangular light source, we can randomly choose a point over its surface area, and the corresponding probability will be:
1/area of light
Since our light sampling is over the area domain instead of the direction domain as state in the above equation. The rendering equation need to multiply by the Jacobian that relates solid angle to area. i.e.

With the same number of sample per pixel, the result is much less noisy:

Uniform explicit light sampling, 64 sample per pixel
Cos weighted explicit light sampling, 64 sample per pixel

Simple de-noise

As we have seen above, the result of path tracing is a bit noise even with 64 samples per pixel. The result will be even worse for the first frame:

first frame path traced result
There are some very bright dots and looks not good during camera motion. So I added a simple de-noise pass, which is just blurring lots of pixels where they are located on the same surface (which really need a lot of pixel to make the result looks good, which cost some performance...).

Blurred first frame path traced result
To identify the pixel correspond to which surface, we store this data in the alpha channel of the path tracing texture with the following formula:
dot(surface_normal, float3(1, 10, 100)) + (mesh_idx + 1) * 1000
This works because we only contains small number of mesh and the mesh normal are the same for each surface in this simple scene.

Random Notes...
During the implementation, I encounter various bugs/artifacts which I think is interesting.

First, is about the simple de-noise pass. It may bleed the light source color to neighbor pixel far away even we have per pixel mesh index data.

This is because we only store a single mesh index per pixel, but we jitter the ray shot from camera within a single pixel per frame, some of the light color will be blend to the light geometry edge. It get very noticeable because the light source have a very high radiance compared to the reflect light of ceiling geometry.

To fix this, I just simply do not jitter the ray for tracing a direct hit of light geometry from camera, so this fix can only apply to explicit light sampling.

The second one is about quantization when using 16bit floating point texture. The path tracing texture sometimes may get quantized result after several hundred frames of additive blend when the single sample per pixel path trace result is very noise.

Quantized implicit light sampling
Path traced result in first frame
simple de-noised first frame result
To work around this, 32bit floating point texture need to be used, but this may have a performance impact (explicitly for my simple de-noise pass...).

The last one is the bright flyflies artifact when using a very large light source (as big as ceiling). This may sound counter intuitive. And the implicit light path traced result(i.e. not sampling the light source directly) does not have those flyflies...

Explicit light sample result
Implicit light sample result
But it turns out this artifact is not related to the size of the light source, but is related to the light too close to the reflected geometry. To visualize it, we may look at how the light get bounced:

path trace depth = 1
path trace depth = 2

The flyflies start to appear in first bound, located at the position near the light source. And then those flyflies get propagated with the reflected light rays. Those large values are generated by explicit light sampling Jacobian transform, the denominator part, which is the distance square between the light and surface.

After a brief search on the internet, to fix this, either need to implement radiance clamping or bi-directional path tracing, or greatly increase the sampling number. Here is the result with over 75000 number of samples per pixel, but it still contains some flyflies...

In this post, we discuss the steps to implement a simple GPU path tracer. The most basic path tracer is simply shooting large number of rays per pixel, and reflect the ray multiple times until it hits a light source. With explicit light sampling, we can greatly reduce noise.

This path tracer is just my personal toy project, which only have Lambert diffuse reflection with a single light. It is my first time to use the D3D12 API, the code is not well optimized, so the source code are for reference only and if you find any bugs, please let me know. Thank you.

[1] Physically Based Rendering http://www.pbrt.org/
[2] https://www.slideshare.net/jeannekamikaze/introduction-to-path-tracing
[3] https://www.slideshare.net/takahiroharada/introduction-to-bidirectional-path-tracing-bdpt-implementation-using-opencl-cedec-2015
[4] http://reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/

Render Passes in "Seal Guardian"

"Seal Guardian" uses a forward renderer to render the scene. Because we need to support mobile platform, we don't have too many effect in it. But still it consists of a few render passes to compose an image.

Shadow Map Pass
To calculate dynamic shadow of the scene, we need to render the depth of the meshes from the light point of view. We render them into a 1024x1024 shadow map.
Standard shadow map

Then we use the Exponential Shadow Map method to blur the shadow map into a 512x512 shadow map.
ESM blurred shadow map

(Note that this pass may be skipped according to current performance setting.)

Opaque Geometry Pass
In this pass, we render the scene meshes into a RGBA8 render target. We compute all the lighting including direct lighting, indirect lighting(lightmap or SH probe), tone mapping in this single pass. This is because on iOS, reducing render pass may have a better performance, so we choose to combine all the calculation into a single pass.
Tonemapped opaque scene color
Opaque geometry depth bufer

To reduce the impact of overdraw, we pre-compute a visibility set to avoid drawing occluded mesh (may talk about it in future post). Also we want to add a bloom pass to enhance the effect of bright pixels, we compute a bloom value in this pass according to the pre-tone mapped value and store it in the alpha channel of this pass.

Transparent Geometry Pass
In this pass, we render transparent mesh and particle. We blend the post-tonemapped color with the opaque geometry due to performance reason. Also, because we store the bloom intensity in the alpha channel and we want the alpha geometry to affect the bloom result. We solve this by 2 different methods depending on the game runs on which platform.

On iOS, we render the mesh directly to the render target of the opaque geometry pass with a shader similar to the opaque pass by outputting tonemapped  scene color in RGB and bloom intensity in A. To blend those 4 values over the opaque value, we use the EXT_shader_framebuffer_fetch OpenGL extension. So the blending happens at the end of the transparent geometry shader and we choose the simple blending formula below by using the opacity of the mesh(because we want to make it consistent with other platform):
RGB= mesh color * mesh alpha + dest color * (1 - mesh alpha)
A = mesh bloom intensity
* mesh alpha + dest bloom intensity * (1 - mesh alpha)
On Windows and Mac, the EXT_shader_framebuffer_fetch does not exist. We render all the transparent meshes into a separate RGBA8 render target. We compute the scene color and bloom intensity similar to opaque pass, but before writing to the render target, we decompose the RGB scene color into luma and chroma and store the chroma value in checkerboard pattern similar to this paper(slide 104). So we can store luma+chroma in RG channel, bloom intensity in B channel and opacity of mesh in the A channel of the render target.
Transparent render target on Windows platform

Finally, we can blend this transparent texture over the opaque geometry pass render target.
Composed opaque and transform geometry

Post Process Pass
After those geometry passes, we can blend in the bloom filter. We make several blur passes for those bright pixels and additive blend over the previous render pass output to enhance the bright effect.
Blurred bright pixels
Additive blended bloom texture with scene color

Then we compute a simplified(but not very accurate, due to the lack of a velocity buffer) temporal anti-aliasing using the color and depth buffer of current frame and previous 2 frames. One thing we didn't mention is that, during rendering the opaque and transparent meshes, we jitter the camera projection by half a pixel, alternating between odd and even frame, similar to the figure below, so that we can have sub-pixel information for anti-aliasing.
Temporal AA jitter pattern
Temporal anti-aliased image

In this post, we break down the render passes in "Seal Guardian", which compose of mainly 4 parts: shadow map, opaque geometry, transparent geometry and post process passes. By making less render pass, we can achieve a constant 60FPS in most cases (if target framerate is not met, we may skip some render pass such as temporal AA and shadow).

Lastly, "Seal Guardian" has already been released on Steam / Mac App Store / iOS App Store. If you want to support us to develop games with custom tech, then buying a copy of the game on any platform will help. Thank you very much.

[1] The Art and Technology behind Crysis 3 http://www.crytek.com/download/fmx2013_c3_art_tech_donzallaz_sousa.pdf