Simon's Tech Blog: 07.19

Reflection and Serialization

Introduction
Reflection and serialization is a convenient way to save/load data. After reading "The Future of Scene Description on 'God of War'", I decided to try to write something called "Compile-time Type Information" described in the presentation (but a much more simple one with less functions). All my need is something to save/load C style struct (something like D3D DESC structure, e.g. D3D12_SHADER_RESOURCE_VIEW_DESC) in my toy engine.

Reflection
A reflection system is needed to describe how struct are defined before writing a serialization system. This site has many information about this topic. I use a similar approach to describe the C struct data with some macro. We define the following 2 data types to describe all possible struct that need to be reflected/serialized in my toy engine (with some variables omitted for easier understanding).

As you can guess from their names, TypeInfo is used to described the C struct that need to be reflected/serialized. And TypeInfoMember is responsible for describing the member variables inside the struct. We can use some macro tricks to reflect a struct(more can be found in the reference):

struct reflection example

The above example reflect 3 variables inside struct vec3: x, y, z. The tricks of those macro is to use sizeof(), alignof(), offsetof() and using keyword. The sample implementation can be found below:

This approach has one disadvantage that we cannot use bit field to specify how many bits are used in a variable. And bit field order seems to be compiler dependent. So I just don't use it for the struct that need to be reflected.

It also has another disadvantage that it is error-prone to reflect each variable manually. So I have written a C struct header parser (using Flex & Bison) to generate the reflection source code. So, for those C struct file that need to auto generate the reflection data, instead of naming the source code file with extension .h, we need to name it with another file extension (e.g. .hds) and using visual studio custom MSBuild file to execute my header parser. To make visual studio to get syntax high light for this custom file type, We need to associate this file extension with C/C syntax by navigate to

"Tools" -> "Options" -> "Text Editor" -> "File Extension"

and add the appropriate association:

But one thing I cannot figure out is the auto-complete when typing "#include" for custom file extension, looks like visual studio only filtered for a couple of extensions (e.g. .h, .inl, ...) and cannot recognize my new file type... If someone knows how to do it, please leave a comment below. Thank you.

MSVC auto-complete filter for .h file only and cannot discover the new type .hds

Serialization
With the reflection data available, we know how large a struct is, how many variables and their byte offset from the start of the struct, so we can serialize our C struct data. We define the serialization format with data header and a number of data chunks as following figure:

Memory layout of a serialized struct

Data Header
The data header contains all the TypeInfo used in the struct that get serialized, as well as the architecture information(i.e. x86 or x64). During de-serialization, we can compare the runtime TypeInfo against the serialized TypeInfo to check whether the struct has any layout/type change (To speed up the comparison, we generate a hash value for every TypeInfo by using the file content that defining the struct). If layout/type change is detected, we de-serialize the struct variables one by one (and may perform the data conversion if necessary, e.g. int to float), otherwise, we de-serialize the data in chunks.

Data Chunk
The value of C struct are stored in data chunks. There are 6 types of data chunks: RawBytes, size_t, String, Struct, PointerSimple, PointerComplex. There are 2 reasons to divide the chunk into different types: First, we want to support the serialized data to be used on different architecture (e.g. serialized on x86, de-serialized on x64) where some data type have different size depends on architecture(e.g. size_t, pointers). Second, we want to support serializing pointers(with some restriction). Below is a simple C struct that illustrate how the data are divided into chunks:

This Sample struct get serialized into 3 data chunks

RawBytes chunk
RawBytes chunk is a chunk that contains a group of values where the size of those variables are architecture independent. Refer to the above Sample struct, the variables val_int and val_float are grouped into a single RawBytes chunk so that during run time, those values can be de-serialized by a single call to memcpy().

size_t chunk
size_t chunk is a chunk that contains a single size_t value, which get serialized as a 64 bit integer value to avoid data loss. But loading a too large value on x86 architecture will cause a warning. Usually this type will not be used, I just add it in case I need to serialize this type for third party library.

String chunk
String chunk is used for storing the string value of char*, the serializer can determine the length the string (looking for '\0') and serialize appropriately.

Struct chunk
Struct chunk is used when we serialize a struct that contains another struct which have some architecture dependent variables. With this chunk type, we can serialize/de-serialize recursively.

The ComplexSample struct contains a Complex struct that has some architecture dependent values,
which cannot be collapsed into a RawBytes chunk, so it get serialized as a struct chunk instead.

PointerSimple chunk
PointerSimple chunk is storing a pointer variable. And the size of the data referenced by this pointer does not depend on architecture and can be de-serialized by a single memcpy() similar to the RawBytes chunk. To determine the length of a pointer (sometimes pointer can be used like an array), my C struct header parser recognizes some special macro which define the length of the pointer (and this macro will be expanded to nothing when parsed by normal Visual Studio C/C++ compiler). Usually during serialization, the length of the pointer depends on another variable within the same struct, so with the special macro, we can define the length of the pointer like below:

The DESC_ARRAY_SIZE() macro tells the serializer that
the size depends on the variable num within the same struct

When serializing the above struct, the serializer will look up the value of the variable num to determine the length of the pointer variable data, so that we know how bytes are needed to be serialized for data.

But using this macro is not enough to cover all my use case, for example when serializing D3D12_SUBRESOURCE_DATA for a 3D texture, the pData variable length cannot be simply calculated by RowPitch and SlicePitch:

A sample struct to serialize a 3D texture, which the length of
D3D12_SUBRESOURCE_DATA::pData depends on the depth of the resources

The length can only be determined when having access to the struct Texture3DDesc, which have the depth information. To tackle this, my serializer can register custom pointer length calculation callback (e.g. register for the D3D12_SUBRESOURCE_DATA::pData variable inside Texture3DDesc struct). The serializer will keep track of a stack of struct type that is currently serializing, so that the callback can be trigger appropriately.

Finally, if a pointer variable does not have any length macro nor registered length calcuation callback, we assume the pointer have a length of 1 (or 0 if nullptr).

PointerComplex chunk
PointerComplex chunk is for storing pointer variable, with the data being referenced is platform dependent, similar to the struct chunk type. It has the same pointer length calculation method as PointerSimple chunk type.

Serialize union
We can also serialize struct with union values that depends on another integer/enum variable, similar to D3D12_SHADER_RESOURCE_VIEW_DESC. We utilize the same macro approach used for pointer length calculation. For example:

A sample to serialize variables inside union

In the above example, the DESC_UNION() macro add information about when the variable need to be serialized. During serialization, we check the value of variable type, if type == ValType::Double, we serialize val_double, else if type == ValType::Integer, we serialize val_integer.

Conclusion
This post have described how a simple reflection system for C struct is implemented, which is a macro based approach, assisted with code generator. Based on the the reflection data, we can implement a serialization system to save/load the C struct using compile time type information. This system is simple, but it does not support complicated features like C++ class inheritance. And it is mainly for serializing C style struct, which is enough for my current need.

References
[1] https://preshing.com/20180116/a-primitive-reflection-system-in-cpp-part-1/
[2] https://www.gdcvault.com/play/1026345/The-Future-of-Scene-Description
[3] https://blog.molecular-matters.com/2015/12/11/getting-the-type-of-a-template-argument-as-string-without-rtti/

Render Graph

Introduction
Render graph is a directed acyclic graph that can be used to specify the dependency of render passes. It is a convenient way to manage the rendering especially when using low level API such as D3D12. There are many great resources talked about it such as this and this. In this post I will talk about how the render graph is set up, render pass reordering as well as resources barrier management.

Render Graph set up
To have a simplified view of render graph, we can treat each node inside a graph as single render pass. For example we can have a graph for a simple deferred renderer like this:

Render passes dependency within a render graph

By having such graph, we can derive the dependency of the render passes, remove unused render pass, as well as reorder them. In my toy graphics engine, I use a simple scheme to reorder render passes. Taking the below render graph as an example, the render pass are added as following order:

A render graph example

We can group it into several dependency levels like this:

split render passes into several dependency levels

Within each level, the passes are independent and can be reordered freely, so the render passes are enqueued into command list as the following order:

Reordered render passes

Between each dependency level, we batch resources barrier to transit the resources to the correct state.

Transient Resources
The above view is just a simplified view of the graph. In fact, each render pass consist of a number of inputs and outputs. Every input/output is a graphics resource (e.g. texture). And render passes are connected through such resources within a render graph.

Render graph connecting render passes and resources

As you can see in the above example, there are many transient resources used (e.g. depth buffer, shadow map, etc). We handle such transient resources by using a texture pool. Texture will be reused after it is no longer need by previous pass (placed resources is not used for simplicity). When building a render graph, we compute the life time of every transient resources (i.e. the dependency level that the resource start/end). So we can free the transient resources when the execution go beyond a dependency level and reuse them for later render pass. So to specify a render pass input/output in my engine, I only need to specify their size/format and don't need to worry about the resources creation and the transient resources pool will create the textures as well as the required resources view (e.g. SRV/DSV/RTV).

Conclusion
In this post, I have described how render passes are reordered inside render graph, when barrier are inserted and transient resources handling. But I have not implemented parallel recording of command lists and async compute. It really takes much more effort to use D3D12 than D3D11. I think the current state of my hobby graphics engine is good enough to use. Looks like I can start learning DXR after spending lots of effort on basic D3D12 set up code. =]

References
[1] https://www.gdcvault.com/play/1024612/FrameGraph-Extensible-Rendering-Architecture-in
[2] https://ourmachinery.com/post/high-level-rendering-using-render-graphs/

D3D12 Constant Buffer Management

Introduction
In D3D12, it does not have an explicit constant buffer API object (unlike D3D11). All we have in D3D12 is ID3D12Resource which need to be sub-divided into smaller region with Constant Buffer View. And it is our job to handle the constant buffer life time and avoid updating constant buffer value while the GPU is still using it. This post will describe how I handle this topic.

Constant buffer pool
We allocate a large ID3D12Resource and treat it as an object pool by sub-dividing it into many small constant buffers (Let's call it constant buffer pool). Since constant buffer required to be 256 bytes aligned (I can only find this requirement in previous documentation, while the updated document only have such requirement in the Uploading Texture Data Through Buffers, which is under a section about texture...), so I defined 3 fixed size pools 256/512/1024 bytes pool. Only this 3 size type is enough for my need as most constant buffers are small (In Seal Guardian, the largest constant buffer size is 560 bytes, while large data like skinning matrix palette is uploaded via texture).

3 constant buffer pools with different size

In last post, a non shader visible descriptor heap manager is used to handle non shader visible descriptors. But in fact, that is only used for SRV/DSV/RTV descriptors. Constant buffer view are managed with another scheme. As described above, when we create a ID3D12Resource for constant buffer pool, we also create a non shader visible ID3D12DescriptorHeap with size large enough to have descriptors point to all the constant buffers inside the constant buffer pool.

ID3D12Resource and ID3D12DescriptorHeap are created in pair

We also split constant buffer pool based on their usage: static/dynamic. So there are total 6 constant buffer pools inside my toy engine (static 256/512/1024 bytes pool + dynamic 256/512/1024 bytes pool).

Dynamic constant buffer
Constant buffer can be updated dynamically. Each constant buffer contains a CPU side copy of their constant values. When they are binded before a draw call, those values will be copied to the dynamic constant buffer pool (created in upload heap). A piece of memory for constant buffer values will be allocated from the constant buffer pool in a ring buffer fashion. If the pool is full (i.e. ring buffer wrap around too fast where all the constant buffers are still in use by GPU), we will create a larger pool and the existing pool will be deleted after all related GPU commands finish execution.

Resizing dynamic constant buffer pool, the previous pool
will be deleted after executing related GPU commands

To avoid copying the same constant buffer values to the constant buffer pool when having multiple binding constant buffer calls. We keep 2 integer values for every dynamic constant buffer: "last upload frame index" and "value version". The last upload frame index is the frame index that those CPU constant buffer values get copied to the dynamic pool. The value version is an integer which is monotonic increased every-time the constant buffer value get modified/updated. So by checking this 2 integers, we can avoid duplicated copies of constant buffer in dynamic pool and re-use the previous copied values.

Static constant buffer
The static constant buffer will have a static descriptor handle described in last post. The static constant buffer pool is created in the default heap. The pool is managed in a free-list fashion as oppose to ring buffer in dynamic pool. Also when the pool is full, we still create extra pool for new constant buffer allocation request. But different from dynamic pool, previous pool will not be deleted when new pool get created.

Creating more static constant buffer pool if existing pools are full

To upload static constant buffer values to the GPU(since static pools are created in default heap), we use the dynamic constant buffer pool instead of creating another new upload heap. Every frame, we gather all newly created static constant buffers, then before we start rendering in this frame, we copy all the CPU constant buffer values to the dynamic constant buffer pool and then schedule a ID3D12GraphicsCommandList::CopyBufferRegion() call to copy those values from upload heap to default heap. By grouping all the static constant buffer uploads, we can reduce the number of D3D12_RESOURCE_BARRIER to transit between the D3D12_RESOURCE_STATE_COPY_DEST and D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER states.

Conclusion
In this post, I have described how constant buffers are managed in my toy engine. It use a number of different pool size which is managed in ring buffer fashion for dynamic constant buffers and in free-list fashion for static constant buffers. Uploading of static constant buffer content are grouped together to reduce barrier usage. However, I only split the usage based simply on static/dynamic, I would like to investigate the performance in the future whether adding another usage type for some use case like constant buffer will be updated every frame, and used frequently in many draw calls (e.g. write once, read many within a frame) and would like to place those resources on the default heap instead of the current dynamic upload heap.

Reference
[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/large-buffers
[2] https://www.gamedev.net/forums/topic/679285-d3d12-how-to-correctly-update-constant-buffers-in-different-scenarios/

訂閱：文章 (Atom)