Simon's Tech Blog: 2019

DXR AO

Introduction
(Edit 3/5/2020: An updated version of the demo can be downloaded here, which support high DPI monitor and some bug fixes)
It has been 2 months since my last post. For the past few months, the situation here in Hong Kong was very bad. Our basic human rights are deteriorating. Absurd things happens such as suspected cooperation between police and triad, as well as the police brutality (including shooting directly at the journalists). I really don't know what can be done... May be, could you spare me a few minutes to sign some of these petitions? Although such petitions may not be very useful, at least after signing some of them, the US Congress is discussing the Hong Kong Human Rights and Democracy Act now. I would sincerely appreciate your help. Thank you very much!

Back to today's topic, after setting up my D3D12 rendering framework, I started to learn DirectX ray-tracing (DXR). So I decided to start writing an ambient occlusion demo first because it is easier than writing a full path tracer since I do not need to handle material information as well as the lighting data. The demo can be downloaded from here (required a DXR compatible graphics card and driver with Windows 10 build version 1809 or newer).

Rendering pipeline
In this demo, it renders a G-buffer with normal and depth data. Then a velocity buffer will be generated using current and previous frame camera transform, stored in RG16Snorm format. Then rays are traced from world position reconstructed from depth buffer with cosine weight distribution. To avoid ray-geometry self intersection, ray origin is shifted towards the camera a bit. After that, a temporal and spatial filter is applied to smooth out the noisy AO image and then an optional bilateral blur pass can be applied for a final clean up.

Temporal Filter
With the noisy image generated from the ray tracing pass, we can reuse previous frame ray-traced data to smooth out the image. In the demo, the velocity buffer is used to get the pixel location in previous frame (with an additional depth check between current frame depth value and the re-projected previous frame depth value). As we are calculating ambient occlusion using Monte Carlo Integration:

We can split the Monte Carlo integration into multiple frames and store the AO result into a RG16Unorm texture, where red channel stores the accumulated AO result, green channel stores the total sample count N (The sample count is clamped to 255 to avoid overflow). So after a new frame is rendered, we can accumulated the AO Monte Carlo Integration with the following equation:

We also reduce the sample count by the delta depth difference between current and previous frame depth buffer value (i.e. when the camera zoom out/in) to "fade out" the accumulated history faster to reduce ghosting artifact.

AO image traced at 1 ray per pixel

AO image with accumulated samples over multiple frames

But this re-projection temporal filter have a short coming that the geometry edge would failed very often (especially when done in half resolution). So in the demo, when re-projection failed, it will shift 1 pixel to perform the re-projection again to accumulate more samples.

Many edge pixels failed the re-projection test

With 1 pixel shifted, many edge pixels can be re-projected

As the result is biased, I have also reduced the sample count by a factor of 0.75 to make the correct ray-traced result "blend in" faster.

Spatial Filter
To increase the sample count for Monte Carlo Integration, we can reuse the ray-traced data in the neighbor pixels. We search in 5x5 grid and reuse the neighbor data if they are on the same surface by comparing their delta depth value (i.e. ddx and ddy generated from depth buffer). As the delta depth value is re-generated from depth buffer, some artifact may been seen on the triangle edge.

noisy AO image applied with a spatial filter

artifact shown at the triangle edge by re-constructed delta depth

To save some performance, beside using half resolution rendering, we can also choose to interleave the ray cast every 4 pixels and ray cast the remaining pixels in next few frames.

Rays are traced only at the red pixels
to save performance

For those pixels without any ray traced data during interleaved rendering, we use the spatial filter to fill in the missing data. The same surface depth check in spatial filter can be by-passed when the sample count(stored in green channel during temporal filter) is low, because it is better to have some "wrong" neighbor data than have no data for the pixel. This also helps to remove the edge artifact shown before.

Rays are traced at interleaved pattern,
leaving many 'holes' in the image

Spatial filter will fill in those 'holes'
during interleaved rendering

Also, when ray casting are interleaved between pixels, we need to pay attention to the temporal filter too. We may have a chance that we re-project to previous frame pixel which have no sample data. In this case, we snap the re-projected UV to the pixel that have cast interleaved ray in previous frame.

Bilateral Blur
To clean up the remaining noise from the temporal and spatial filter. A bilateral blur is applied, we can have a wider blur by using the edge aware A-Trous algorithm. The blur radius is adjusted according to the sample count (stored in green channel in temporal filter). So when we have already cast many ray samples, we can reduce the blur radius to have a sharper image.

Applying an additional bilateral blur to smooth out remaining noise

Random Ray Direction
When choosing the random ray cast direction, we want those chosen direction can have a more significance effect. Since we have a spatial filter to reuse neighbor pixels data, so we can try to cast rays in directions such that the angle between the ray direction in neighbor pixels should be as large as possible and also cover as much hemisphere area as possible.

It looks like we can use some kind of blue noise texture so that the ray direction is well distributed. Let's take a look at how the cosine weighted random ray direction is generated:

From the above equation, the random variable ϕ is directly corresponding to the random ray direction on the tangent plane, which have a linear relationship between the angle ϕ and random variable ξ2. Since we generate random numbers using wang hash, which is a white noise. May be we can stratified the random range and using the blue noise to pick a desired range to turn it like a blue noise pattern. For example, originally we have a random number between [0, 1), we may stratified it into 4 ranges: [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1). Then using the screen space pixel coordinates to sample a tileable blue noise texture. And according to the value of the blue noise, we scale the white noise random number into 1 of the 4 stratified range. Below is some sample code of how the stratification is done:

int BLUE_NOISE_TEX_SIZE= 64;
int STRATIFIED_SIZE= 16;
float4 noise= blueNoiseTex[pxPos % BLUE_NOISE_TEX_SIZE];
uint2 noise_quantized= noise.xy * (255.0 * STRATIFIED_SIZE / 256.0);
float2 r= wang_hash(pxPos); // random white noise in range [0, 1)
r = mad(r, 1.0/STRATIFIED_SIZE, noise_quantized * (1.0/STRATIFIED_SIZE));

With the blue noise adjusted ray direction, the ray traced AO image looks less noisy visually:

Rays are traced using white noise

Rays are traced using blue noise

Blurred white noise AO image

Blurred blue noise AO image

Ray Binning
In the demo, ray binning is also implemented, but the performance improvement is not significant. The ray binning only show a large performance gain when the ray tracing distance is large (e.g. > 10m) and turning off both half resolution and interleaved rendering. I have only ran the demo on my GTX 1060, may be the situation will be different on RTX graphcis card (so, this is something I need to investigate in the future). Also the demo may have a slight difference when toggling on/off ray binning due to the precision difference using RGBA16Float format to store ray direction (the difference will be vanished after accumulating more samples over multiple frames with temporal filter).

Conclusion
In this post, I have described how DXR is used to compute ray-traced AO in real-time using a combination of temporal and spatial filter. Those filters are important to increase the total sample count for Monte Carlo Integration and getting a noise free and stable image. The demo can be downloaded from here. There are still plenty of stuff to improve, such as having a better filter, currently, when the AO distance is large and both half resolution and interleaved rendering is turned on (i.e. 1 ray per 16 pixels), the image is too noisy and not temporally stable during camera movement. May be I need to improve those stuff when writing a path tracer in the future.

References
[1] DirectX Raytracing (DXR) Functional Spec https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html
[2] Edge-Avoiding À-Trous Wavelet Transform for fast GlobalIllumination Filtering https://jo.dreggn.org/home/2010_atrous.pdf
[3] Free blue noise textures http://momentsingraphics.de/BlueNoise.html
[4] Quick And Easy GPU Random Numbers In D3D11 http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/
[5] Leveraging Real-Time Ray Tracing to build a Hybrid Game Engine http://advances.realtimerendering.com/s2019/Benyoub-DXR%20Ray%20tracing-%20SIGGRAPH2019-final.pdf
[6] ”It Just Works”: Ray-Traced Reflections in 'Battlefield V' https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s91023-it-just-works-ray-traced-reflections-in-battlefield-v.pdf

Reflection and Serialization

Introduction
Reflection and serialization is a convenient way to save/load data. After reading "The Future of Scene Description on 'God of War'", I decided to try to write something called "Compile-time Type Information" described in the presentation (but a much more simple one with less functions). All my need is something to save/load C style struct (something like D3D DESC structure, e.g. D3D12_SHADER_RESOURCE_VIEW_DESC) in my toy engine.

Reflection
A reflection system is needed to describe how struct are defined before writing a serialization system. This site has many information about this topic. I use a similar approach to describe the C struct data with some macro. We define the following 2 data types to describe all possible struct that need to be reflected/serialized in my toy engine (with some variables omitted for easier understanding).

As you can guess from their names, TypeInfo is used to described the C struct that need to be reflected/serialized. And TypeInfoMember is responsible for describing the member variables inside the struct. We can use some macro tricks to reflect a struct(more can be found in the reference):

struct reflection example

The above example reflect 3 variables inside struct vec3: x, y, z. The tricks of those macro is to use sizeof(), alignof(), offsetof() and using keyword. The sample implementation can be found below:

This approach has one disadvantage that we cannot use bit field to specify how many bits are used in a variable. And bit field order seems to be compiler dependent. So I just don't use it for the struct that need to be reflected.

It also has another disadvantage that it is error-prone to reflect each variable manually. So I have written a C struct header parser (using Flex & Bison) to generate the reflection source code. So, for those C struct file that need to auto generate the reflection data, instead of naming the source code file with extension .h, we need to name it with another file extension (e.g. .hds) and using visual studio custom MSBuild file to execute my header parser. To make visual studio to get syntax high light for this custom file type, We need to associate this file extension with C/C syntax by navigate to

"Tools" -> "Options" -> "Text Editor" -> "File Extension"

and add the appropriate association:

But one thing I cannot figure out is the auto-complete when typing "#include" for custom file extension, looks like visual studio only filtered for a couple of extensions (e.g. .h, .inl, ...) and cannot recognize my new file type... If someone knows how to do it, please leave a comment below. Thank you.

MSVC auto-complete filter for .h file only and cannot discover the new type .hds

Serialization
With the reflection data available, we know how large a struct is, how many variables and their byte offset from the start of the struct, so we can serialize our C struct data. We define the serialization format with data header and a number of data chunks as following figure:

Memory layout of a serialized struct

Data Header
The data header contains all the TypeInfo used in the struct that get serialized, as well as the architecture information(i.e. x86 or x64). During de-serialization, we can compare the runtime TypeInfo against the serialized TypeInfo to check whether the struct has any layout/type change (To speed up the comparison, we generate a hash value for every TypeInfo by using the file content that defining the struct). If layout/type change is detected, we de-serialize the struct variables one by one (and may perform the data conversion if necessary, e.g. int to float), otherwise, we de-serialize the data in chunks.

Data Chunk
The value of C struct are stored in data chunks. There are 6 types of data chunks: RawBytes, size_t, String, Struct, PointerSimple, PointerComplex. There are 2 reasons to divide the chunk into different types: First, we want to support the serialized data to be used on different architecture (e.g. serialized on x86, de-serialized on x64) where some data type have different size depends on architecture(e.g. size_t, pointers). Second, we want to support serializing pointers(with some restriction). Below is a simple C struct that illustrate how the data are divided into chunks:

This Sample struct get serialized into 3 data chunks

RawBytes chunk
RawBytes chunk is a chunk that contains a group of values where the size of those variables are architecture independent. Refer to the above Sample struct, the variables val_int and val_float are grouped into a single RawBytes chunk so that during run time, those values can be de-serialized by a single call to memcpy().

size_t chunk
size_t chunk is a chunk that contains a single size_t value, which get serialized as a 64 bit integer value to avoid data loss. But loading a too large value on x86 architecture will cause a warning. Usually this type will not be used, I just add it in case I need to serialize this type for third party library.

String chunk
String chunk is used for storing the string value of char*, the serializer can determine the length the string (looking for '\0') and serialize appropriately.

Struct chunk
Struct chunk is used when we serialize a struct that contains another struct which have some architecture dependent variables. With this chunk type, we can serialize/de-serialize recursively.

The ComplexSample struct contains a Complex struct that has some architecture dependent values,
which cannot be collapsed into a RawBytes chunk, so it get serialized as a struct chunk instead.

PointerSimple chunk
PointerSimple chunk is storing a pointer variable. And the size of the data referenced by this pointer does not depend on architecture and can be de-serialized by a single memcpy() similar to the RawBytes chunk. To determine the length of a pointer (sometimes pointer can be used like an array), my C struct header parser recognizes some special macro which define the length of the pointer (and this macro will be expanded to nothing when parsed by normal Visual Studio C/C++ compiler). Usually during serialization, the length of the pointer depends on another variable within the same struct, so with the special macro, we can define the length of the pointer like below:

The DESC_ARRAY_SIZE() macro tells the serializer that
the size depends on the variable num within the same struct

When serializing the above struct, the serializer will look up the value of the variable num to determine the length of the pointer variable data, so that we know how bytes are needed to be serialized for data.

But using this macro is not enough to cover all my use case, for example when serializing D3D12_SUBRESOURCE_DATA for a 3D texture, the pData variable length cannot be simply calculated by RowPitch and SlicePitch:

A sample struct to serialize a 3D texture, which the length of
D3D12_SUBRESOURCE_DATA::pData depends on the depth of the resources

The length can only be determined when having access to the struct Texture3DDesc, which have the depth information. To tackle this, my serializer can register custom pointer length calculation callback (e.g. register for the D3D12_SUBRESOURCE_DATA::pData variable inside Texture3DDesc struct). The serializer will keep track of a stack of struct type that is currently serializing, so that the callback can be trigger appropriately.

Finally, if a pointer variable does not have any length macro nor registered length calcuation callback, we assume the pointer have a length of 1 (or 0 if nullptr).

PointerComplex chunk
PointerComplex chunk is for storing pointer variable, with the data being referenced is platform dependent, similar to the struct chunk type. It has the same pointer length calculation method as PointerSimple chunk type.

Serialize union
We can also serialize struct with union values that depends on another integer/enum variable, similar to D3D12_SHADER_RESOURCE_VIEW_DESC. We utilize the same macro approach used for pointer length calculation. For example:

A sample to serialize variables inside union

In the above example, the DESC_UNION() macro add information about when the variable need to be serialized. During serialization, we check the value of variable type, if type == ValType::Double, we serialize val_double, else if type == ValType::Integer, we serialize val_integer.

Conclusion
This post have described how a simple reflection system for C struct is implemented, which is a macro based approach, assisted with code generator. Based on the the reflection data, we can implement a serialization system to save/load the C struct using compile time type information. This system is simple, but it does not support complicated features like C++ class inheritance. And it is mainly for serializing C style struct, which is enough for my current need.

References
[1] https://preshing.com/20180116/a-primitive-reflection-system-in-cpp-part-1/
[2] https://www.gdcvault.com/play/1026345/The-Future-of-Scene-Description
[3] https://blog.molecular-matters.com/2015/12/11/getting-the-type-of-a-template-argument-as-string-without-rtti/

Render Graph

Introduction
Render graph is a directed acyclic graph that can be used to specify the dependency of render passes. It is a convenient way to manage the rendering especially when using low level API such as D3D12. There are many great resources talked about it such as this and this. In this post I will talk about how the render graph is set up, render pass reordering as well as resources barrier management.

Render Graph set up
To have a simplified view of render graph, we can treat each node inside a graph as single render pass. For example we can have a graph for a simple deferred renderer like this:

Render passes dependency within a render graph

By having such graph, we can derive the dependency of the render passes, remove unused render pass, as well as reorder them. In my toy graphics engine, I use a simple scheme to reorder render passes. Taking the below render graph as an example, the render pass are added as following order:

A render graph example

We can group it into several dependency levels like this:

split render passes into several dependency levels

Within each level, the passes are independent and can be reordered freely, so the render passes are enqueued into command list as the following order:

Reordered render passes

Between each dependency level, we batch resources barrier to transit the resources to the correct state.

Transient Resources
The above view is just a simplified view of the graph. In fact, each render pass consist of a number of inputs and outputs. Every input/output is a graphics resource (e.g. texture). And render passes are connected through such resources within a render graph.

Render graph connecting render passes and resources

As you can see in the above example, there are many transient resources used (e.g. depth buffer, shadow map, etc). We handle such transient resources by using a texture pool. Texture will be reused after it is no longer need by previous pass (placed resources is not used for simplicity). When building a render graph, we compute the life time of every transient resources (i.e. the dependency level that the resource start/end). So we can free the transient resources when the execution go beyond a dependency level and reuse them for later render pass. So to specify a render pass input/output in my engine, I only need to specify their size/format and don't need to worry about the resources creation and the transient resources pool will create the textures as well as the required resources view (e.g. SRV/DSV/RTV).

Conclusion
In this post, I have described how render passes are reordered inside render graph, when barrier are inserted and transient resources handling. But I have not implemented parallel recording of command lists and async compute. It really takes much more effort to use D3D12 than D3D11. I think the current state of my hobby graphics engine is good enough to use. Looks like I can start learning DXR after spending lots of effort on basic D3D12 set up code. =]

References
[1] https://www.gdcvault.com/play/1024612/FrameGraph-Extensible-Rendering-Architecture-in
[2] https://ourmachinery.com/post/high-level-rendering-using-render-graphs/

D3D12 Constant Buffer Management

Introduction
In D3D12, it does not have an explicit constant buffer API object (unlike D3D11). All we have in D3D12 is ID3D12Resource which need to be sub-divided into smaller region with Constant Buffer View. And it is our job to handle the constant buffer life time and avoid updating constant buffer value while the GPU is still using it. This post will describe how I handle this topic.

Constant buffer pool
We allocate a large ID3D12Resource and treat it as an object pool by sub-dividing it into many small constant buffers (Let's call it constant buffer pool). Since constant buffer required to be 256 bytes aligned (I can only find this requirement in previous documentation, while the updated document only have such requirement in the Uploading Texture Data Through Buffers, which is under a section about texture...), so I defined 3 fixed size pools 256/512/1024 bytes pool. Only this 3 size type is enough for my need as most constant buffers are small (In Seal Guardian, the largest constant buffer size is 560 bytes, while large data like skinning matrix palette is uploaded via texture).

3 constant buffer pools with different size

In last post, a non shader visible descriptor heap manager is used to handle non shader visible descriptors. But in fact, that is only used for SRV/DSV/RTV descriptors. Constant buffer view are managed with another scheme. As described above, when we create a ID3D12Resource for constant buffer pool, we also create a non shader visible ID3D12DescriptorHeap with size large enough to have descriptors point to all the constant buffers inside the constant buffer pool.

ID3D12Resource and ID3D12DescriptorHeap are created in pair

We also split constant buffer pool based on their usage: static/dynamic. So there are total 6 constant buffer pools inside my toy engine (static 256/512/1024 bytes pool + dynamic 256/512/1024 bytes pool).

Dynamic constant buffer
Constant buffer can be updated dynamically. Each constant buffer contains a CPU side copy of their constant values. When they are binded before a draw call, those values will be copied to the dynamic constant buffer pool (created in upload heap). A piece of memory for constant buffer values will be allocated from the constant buffer pool in a ring buffer fashion. If the pool is full (i.e. ring buffer wrap around too fast where all the constant buffers are still in use by GPU), we will create a larger pool and the existing pool will be deleted after all related GPU commands finish execution.

Resizing dynamic constant buffer pool, the previous pool
will be deleted after executing related GPU commands

To avoid copying the same constant buffer values to the constant buffer pool when having multiple binding constant buffer calls. We keep 2 integer values for every dynamic constant buffer: "last upload frame index" and "value version". The last upload frame index is the frame index that those CPU constant buffer values get copied to the dynamic pool. The value version is an integer which is monotonic increased every-time the constant buffer value get modified/updated. So by checking this 2 integers, we can avoid duplicated copies of constant buffer in dynamic pool and re-use the previous copied values.

Static constant buffer
The static constant buffer will have a static descriptor handle described in last post. The static constant buffer pool is created in the default heap. The pool is managed in a free-list fashion as oppose to ring buffer in dynamic pool. Also when the pool is full, we still create extra pool for new constant buffer allocation request. But different from dynamic pool, previous pool will not be deleted when new pool get created.

Creating more static constant buffer pool if existing pools are full

To upload static constant buffer values to the GPU(since static pools are created in default heap), we use the dynamic constant buffer pool instead of creating another new upload heap. Every frame, we gather all newly created static constant buffers, then before we start rendering in this frame, we copy all the CPU constant buffer values to the dynamic constant buffer pool and then schedule a ID3D12GraphicsCommandList::CopyBufferRegion() call to copy those values from upload heap to default heap. By grouping all the static constant buffer uploads, we can reduce the number of D3D12_RESOURCE_BARRIER to transit between the D3D12_RESOURCE_STATE_COPY_DEST and D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER states.

Conclusion
In this post, I have described how constant buffers are managed in my toy engine. It use a number of different pool size which is managed in ring buffer fashion for dynamic constant buffers and in free-list fashion for static constant buffers. Uploading of static constant buffer content are grouped together to reduce barrier usage. However, I only split the usage based simply on static/dynamic, I would like to investigate the performance in the future whether adding another usage type for some use case like constant buffer will be updated every frame, and used frequently in many draw calls (e.g. write once, read many within a frame) and would like to place those resources on the default heap instead of the current dynamic upload heap.

Reference
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/large-buffers
[2] https://www.gamedev.net/forums/topic/679285-d3d12-how-to-correctly-update-constant-buffers-in-different-scenarios/

D3D12 Descriptor Heap Management

Introduction
Continue with the last post, we described about how root signature is managed to bind resources. But root signature is just one part of resources binding, we also need to use descriptor to bind reousrces. Descriptors are small block of memory describing an object (CBV/SRV/UAV/Sampler) to GPU. They are stored in descriptor heaps, and they may be shader visible or non shader visible. In this post, I will talk about how descriptors are managed for resources binding in my toy graphics engine.

Non shader visible heap
Let's start with the non-shader visible heap management. We can treat a descriptor as a pointer to a GPU resource (e.g. texture). Descriptor heap is a piece of memory used for storing descriptors and the size of a single descriptor can be queried by ID3D12Device::GetDescriptorHandleIncrementSize(). So we treat descriptor heap as an object pool, and every descriptor within the same heap can be referenced by an index.

Non shader visible descriptor heap containing N descriptors

Since we don't know how many descriptors are needed in-advance and we may have many non shader visible heaps, A non shader visible heap manager is created for allocating a descriptor from descriptor heap(s). This manager contains at least 1 descriptor heap. When a descriptor allocation request is made to the manager, it will first look for free descriptor from existing descriptor heap(s), if none is found, a new descriptor heap will be created to handle the request.

Descriptor heap manager handles descriptor allocation request, create descriptor heap if necessary

So within the graphics engine, we use a "non shader visible descriptor handle" to reference a D3D12 descriptor which store the heap index and descriptor index with respect to a descriptor heap manager. All the textures created in the engine will have a "non shader visible descriptor handle" for resources binding (more on this later).

Shader visible heap
Next, we will talk about shader visible heap management. Shader visible heap is responsible for binding resources that get used in shaders. It is recommend that just only 1 heap is used for all frames so that asynchronous compute and graphics workload can be run in parallel(on NVidia hardware). So we just create 1 large shader visible heap at the start of program and don't bother to resize/allocate a lager heap when the heap is full (we just assert in this case). With a single large shader visible descriptor heap, it is divided into 2 regions: static / dynamic.

A single large shader visible descriptor heap, divided into 2 regions

Dynamic descriptor
Dynamic descriptors are used for some transient resources that their descriptor table cannot be reused often. During resources binding (e.g. texture), their non-shader visible descriptors will be copied to the shader visible heap via ID3D12Device::CopyDescriptors(), where the copy destination (i.e. dynamic shader visible descriptors) is allocated in a ring buffer fashion (Note the copy operation have a restriction that the copy source must be in non-shader visible heap, that's why we allocate a "non shader visible descriptor handle" for every texture).

Static descriptor
Static descriptors are used for resources which can be grouped together into a descriptor table, so that they can be reused over multiple frames. For example, a set of textures inside a material will not be changed very often, those textures can be grouped into a descriptor table. My current approach is to use a "stack" based approach to manage the static shader visible descriptor heap. Instead of creating a stack of individual descriptor, we have a stack of groups of descriptors, often, during level load, 1 static descriptor group will be created.

static descriptors are packed into group during level load

Inside a group of static descriptors, the descriptors are sorted such that all constant buffer descriptors appear before texture descriptors. Also null descriptors may need to be added to respect the Hardware Tiers restriction. To identify a static descriptor in shader visible heap, we can use the stack group index together with the descriptor index within the group.

descriptor are ordered by type, with necessary padding

Each "static resource"(e.g. constant buffer/texture) will have a "static descriptor handle" beside the "non shader visible descriptor handle". We can check whether those resources are within the same descriptor table by comparing the stack group index and descriptor index to see whether they are in consecutive order. With such information, we can create a resource binding API similar to D3D11 (e.g. ID3D11DeviceContext::PSSetShaderResources() ), if all the resources in the API call are in the same descriptor table, we can use the static descriptor to bind the descriptor table directly, otherwise, we switch to use the dynamic descriptor approach described in previous section to create a continuous descriptor table. (I have also think of instead of using similar binding API as D3D11, may be I can create a so call "descriptor table" object explicitly, say during material loading and grouping material textures into a descriptor table, so that resources binding can skip the consecutive descriptor index check described above. But currently I just stick with a simple solution first...)

As mentioned before, the static descriptor group is allocated based on a "stack" based approach. But my current implement is not strictly "last in - first out", we can removing a group in between and make some "hole" in the static shader visible heap region, but this will result in fragmentation.

Fragmented static descriptor heap region

In theory, we can defragment this heap region by moving descriptor groups to un-used space (it works as we use index to reference descriptors inside a heap instead of address directly) and during defragmentation, we may switch to use dynamic descriptors temporarily to avoid overwriting the static heap region while the GPU commands are still using it. But currently, I have not implemented the defragmentation yet because I only get one simple level (i.e. only 1 static descriptor group) now...

Conclusion
In this post, I have described how the descriptor heap is managed for resources binding. To sum up, the shader visible descriptor heap is divided into 2 regions: static/dynamic. Static descriptor heap is managed in a "stack" based approach. During level loading, all the static CBV/SRV descriptors are stored within a static descriptor stack group, which is a big continuous descriptor table. This will increase the chance to reuse the descriptor table. In addition to this optional static descriptor, every resources must have a non-shader visible descriptor handle. This non-shader visible descriptor handle is used when a static descriptor table cannot be used during resource binding, and it will get copied to the shader visible heap to form a new descriptor table. With this kind of heap management, we can create a resources binding API similar to D3D11, which call the underlying D3D12 API using descriptors.

References
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/resource-binding
[2] https://www.gamedev.net/forums/topic/686440-d3d12-descriptor-heap-strategies/

D3D12 Root Signature Management

Introduction
Continue with the last post about writing my new toy D3D12 graphics engine, we have compiled some shaders and extracted some reflection data from shader source. The next problem is to bind resources(e.g. constant buffer / textures) to the shaders. D3D12 use root signatures together with root parameters to achieve this task. In this post, I will describe how my toy engine create root signatures automatically based on shader resources usage.

Left: new D3D12 graphics engine (with only basic diffuse material)

Right: previous D3D11 rendering (with PBR material, GI...)

Still a long way to go to catch up with the previous renderer...

Resource binding model
In D3D12, shader resource binding relies on the root parameter index. But when iterating on shader code, we may modify some resources binding(e.g. add a texture variable / remove a constant buffer), the root signature may be changed, which cause the change of root parameter index. This will need to update all function call like SetGraphicsRootDescriptorTable() with new root parameter index, which is tedious and error-prone... Compare to the resource binding model in D3D11 (e.g. PSSetShaderResources(), PSSetConstantBuffers()), it doesn't have such problem as the API defined a set of fixed slots to bind with. So I would prefer to work with a similar binding model in my toy engine.

So, I defined a couple of slots for resource binding as follow (which is a bit different than D3D11):

Engine_PerDraw_CBV
Engine_PerView_CBV
Engine_PerFrame_CBV
Engine_PerDraw_SRV_VS_ONLY
Engine_PerDraw_SRV_PS_ONLY
Engine_PerDraw_SRV_ALL
Engine_PerView_SRV_VS_ONLY
Engine_PerView_SRV_PS_ONLY
Engine_PerView_SRV_ALL
Engine_PerFrame_SRV_VS_ONLY
Engine_PerFrame_SRV_PS_ONLY
Engine_PerFrame_SRV_ALL
Shader_PerDraw_CBV
Shader_PerDraw_SRV_VS_ONLY
Shader_PerDraw_SRV_PS_ONLY
Shader_PerDraw_SRV_ALL
Shader_PerDraw_UAV

Instead of having a fixed slot per shader stage in D3D11, my toy engine fixed slots can be summarized into 3 categories as:

Resource binding slot categories

Slot category "Resource Type"
As described by its name (CBV/SRV/UAV), this slot is used to bind the corresponding resource type like constant buffer / shader resource view / unordered access view.
For SRV type, it further sub-divide into VS_ONLY / PS_ONLY / ALL sub-categories which refer to the shader visibility. According to Nvidia Do's An Don'ts, limiting the shader visibility will improve the performance.
For CBV type, the shader visibility will be deduced from shader reflection data during root signature and PSO creation.

Slot category "Change frequency"
Resources are encouraged to be bound based on their update frequency. So this slot category are divided into 3 types: Per Frame/ Per View / Per Draw.
For the Per Frame/View types, they will have a root parameter type as descriptor table.
While the Per Draw CBV type will have root parameter type as root descriptor.
For Per Draw SRV type, it still uses descriptor table instead of root descriptor, because for example, it is common to have only 1 constant buffer for material of a mesh while binding multiple textures for the same material. So using descriptor table instead will help to keep the size of root signature small.

Slot category "Usage"
This category is used for sub-dividing into different usage patterns: Engine/Shader.
For Engine usage, it will typically be binding stuff like mesh transform constant, camera transform, etc.
For Shader usage, it is used for something like shader specific stuff, e.g. material constant.
I just can't find the appropriate name for this category, and simply use the name as Engine/Shader. May be it is better to call them Group 0/1/2/3... in case I may have different usage patterns in the future. But currently I just don't bother with it now...

Shader Reflection
In last post, I have mentioned that during shader compilation, shader reflection data is exported. This is important for the root signature creation. From these reflection data, we can know which constant buffer/texture slots get used. When creating a pipeline state object(PSO) from shaders, we can deduce all the resources slots get used in PSO (as well as the shader visibility for constant buffer) and then create an appropriate root signature with each resource slot mapped to the corresponding root parameter index (let's call this mapping data as "root signature info").

To specify the resource slot in shader code, we make use of the register space introduced in Shader Model 5.1. We can define which slot is used for constant buffer/texture. For example:

#define ENGINE_PER_DRAW_SRV_ALL space5 // all shaders must have the same slot-space definition
Texture2D shadowMap : register(t0, ENGINE_PER_DRAW_SRV_ALL);

With the above information, on the CPU side code, we can bind a resource to a specific slot using the root parameter index stored inside the "root signature info" similar to D3D11 API.

Conclusion

In this post, we have described how root signature can be automatically created and used for a slot based API. First root signature are created(or re-used/shared) during the creation of pipeline state object(PSO) based on its shader reflection data. We also create a "root signature info" to store the mapping between resource slots and root parameter index together with the root signature and PSO. Then we can use this "root signature info" to bind the resources to the shader.

As this is my first time to write a graphics engine with D3D12. I am not sure whether this resource binding model is the best. I have also think of other naming scheme for the resource slots: instead of naming with PerDraw / PerView type, is it better to name it explicitly with RootDescriptor / DescriptorTable instead? May be I will change my mind after I gained more experience in the future...

Reference
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/root-signatures
[2] https://developer.nvidia.com/dx12-dos-and-donts

MSBuild custom build tools notes

Introduction
Recently, I am trying to re-write my graphics code to use D3D12 (instead of D3D11 in Seal Guardian). I need to have a convenient way to compile shader code. While tidying up the MSBuild custom build steps files used in Seal Guardian for my new toy graphics engine, I regret that I did not write a blog post about custom MSBuild at that time, as I remembered, finding such information was hard at that time and I need to look at some of the CUDA custom build files to guess how it works. So this post will just be my personal notes about custom MSBuild and I don't guarantee all information about MSBuild are 100% correct. I have uploaded an example project to compile shader files here. Interested readers may also check out this excellent post about MSBuild written by Nathan Reed.

Custom build steps set up
MSBuild need to have .targets file to describe how the compiler (e.g. dxc/fxc used for shader compilation) are invoked. In the uploaded example project, we have 3 main targets: DXC, JSON, BIN.

- DXC target: described by its name, invoking the dxc.exe to compile HLSL file.
- JSON target: used to invoke the shaderCompiler.exe, which is our internal tool written using Flex & Bison to parse the shader source code to output some meta data, like texture/constant buffer usage for root signature management.
- BIN target: a task that depends on DXC task and JSON task, invoke the dataBuilder.exe, our internal tool for data serialization/deserialization into our binary format, combining the output from DXC and JSON task.

target dependency

Although MSBuild can set up the target dependency, but it looks like those independent targets are not executed in parallel. In Seal Guardian, when compiling the surface shaders which generate the shader permutation for lighting, this result in a long compilation time. At the end, I need to create another exe to launch multiple threads to speed up the shader compilation. May be I was setting up MSBuild incorrectly, if anyone knows how to parallelize it, please let me know in the comment below. Thank you!

Incremental Builds
MSBuild use .tlog file to track files modification to avoid unnecessary compilation (also affect which files got deleted when cleaning the project). There are 2 tlog files (read.1.tlog and write.1.tlog), one is for tracking whether the source files are modified, and the other is tracking whether the output file is up to date. We can simply use the WriteLinesToFile task to mark such dependency, e.g.

But doing this only will make the tlog file larger and larger after every compilation. So it is better to read the tlog file content into a PropertyGroup and check whether the file already contains the text we would like to write using a "Conidition" inside the WriteLinesToFile task. For details, please take a look at the example project.

Also, as a side note, do not include $(MSBuildProjectFile) property in the "Inputs" element inside "Target" task. I did it accidentally and it cause the whole project to recompile all the shaders every time a new shader file is added to / removed from the project. This is not necessary as most of the shader files are independent.

Output files
Like every visual studio project, our example project have a Debug and Release configuration. After executing the BIN task described above, we also use a Copy task to copy our compiled shader from Debug/Release config $(OutDir) directory to our content directory. We can also use the Property Function MakeRelative() to maintain the directory hierarchy in the output directory. This is another reason why I use Copy task instead of specifying the $(OutDir) to the content directory, as I cannot get a nested Property Function working inside the .props file (or may be I did something wrong? I don't know...)...

Also, beside output files, another note is about the output log. If we want to write something to the output console of visual studio from your custom exe (e.g. dataBuilder.exe/dataBuilder.exe in the example project), the text must be in a specific format like (but I cannot find the documentation of the exact format, just guess from similar message emitted from visual studio...):

1>C:/shaderFileName.hlsl(123): error : error message

otherwise, those message will not get displayed in output window.

Example project
An example project is uploaded here. It will compile vertex/pixel shaders with dxc.exe and output JSON meta data to the $(IntDir), then combine those data and write to the $(OutDir). Finally those files will be copied to the asset directory with the corresponding relative path to the source directory. Please note that the shaderCompiler.exe used for outputting meta-data is for internal tools, which have some custom restriction on the HLSL grammar for my convenience to create root signature. It is used just as an example to illustrate how to set up a custom MSBuild tool, feel free to replace/modify those files to suit your own need. Thank you.

Reference
[1] https://docs.microsoft.com/en-us/cpp/build/understanding-custom-build-steps-and-build-events?view=vs-2019
[2] http://www.reedbeta.com/blog/custom-toolchain-with-msbuild/

訂閱：文章 (Atom)