遍历着色器中的界层层次 [英] Traversal of Bounding Volume Hierachy in Shaders

查看:80
本文介绍了遍历着色器中的界层层次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用vulkan计算着色器开发路径跟踪器。我实现了一个表示



图片取自此处。相关的论文和源代码也在Toshiya Hachisuka教授的页面 。在本文引用的本文中也描述了相同的概念。幻灯片






#3 BVH树(具有链接和点击链接)

我必须扩展通过链接推送到着色器的数据。此外,还需要一些离线摆弄才能正确存储树。最初,我尝试使用while循环(循环直到 box_index_next 为-1),这再次导致了疯狂的速度下降。无论如何,以下操作相当快速:

  int box_index_next = 0; 

for(int box_index = 0; box_index< box_count; box_index ++){
if(box_index!= box_index_next){
继续;
}

bool hit = intersect_box(boxes [box_index],ray);
bool leaf = box [box_index] .ids_count> 0;

if(hit){
box_index_next = box [box_index] .links.x; //点击链接
}否则{
box_index_next = box [box_index] .links.y; //如果未命中链接
}

if(命中&叶子){
uint a = box [box_index] .ids_offset;
uint b = a + box [box_index] .ids_count; (uint j = a; j< b; j ++)的

{
uint triangle_id = triangle_references [j];
//三角形交点代码...
}
}
}






此代码比快速但有缺陷的实现#1慢大约3倍。这是可以预料的,现在速度取决于实际的树,而不取决于gpu优化。例如,考虑一个退化的情况,其中三角形沿轴对齐:同一方向的射线可能与所有三角形相交,然后需要评估所有树叶。



教授Toshiya Hachisuka提出了进一步优化此类情况的方法,(第36页及以后):一个存储BVH树的多个版本,并按x,-x,y,-y,z和-z在空间上进行排序。为了遍历,需要根据射线选择正确的版本。然后,只要从叶子上切下一个三角形,就可以停止遍历,因为所有要访问的其余节点都将在空间上位于该节点之后(从射线的角度来看)。






建立BVH树后,找到链接非常简单(下面的一些python代码):

 类NodeAABB(对象):

def __init __(self,obj_bounds,obj_ids):
self.children = [无,无]
self.obj_bounds = obj_bounds
self.obj_ids = obj_ids

def split(self):
#递归拆分并在此处创建子级
提高NotImplementedError()

def is_leaf(self):
return set(self.children)== {无}

def build_links(self,next_right_node = None):
如果不是self.is_leaf():
child1,child2 = self.children

self.hit_node = child1
self.miss_node = next_right_node

child1.build_links(next_right_node = child2)
child2.build_links(next_right_node = next_right_node)

else:
self.hit_node = next_right_node
self.miss_node = self.hit_node

def collect(self):
#以正确的顺序首先深入检索
如果没有self.is_leaf( ):
child1,child2 = self.children
child1.collect()的收益
child2.collect()的收益

将所有AABB存储在一个阵列(将发送到GPU)中之后,可以使用 hit_node miss_node 查找链接的索引并存储它们。


I am working on a path tracer using vulkan compute shaders. I implemented a tree representing a bounding volume hierachy. The idea of the BVH is to minimize the amount of objects a ray intersection test needs to be performed on.


#1 Naive Implementation

My first implementation is very fast, it traverses the tree down to a single leaf of the BVH tree. However, the ray might intersect multiple leaves. This code then leads to some triangles not being rendered (although they should).

int box_index = -1;

for (int i = 0; i < boxes_count; i++) {
    // the first box has no parent, boxes[0].parent is set to -1
    if (boxes[i].parent == box_index) {
        if (intersect_box(boxes[i], ray)) {
            box_index = i;
        }
    }
}

if (box_index > -1) {
    uint a = boxes[box_index].ids_offset;
    uint b = a + boxes[box_index].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}


#2 Multi-Leaf Implementation

My second implementation accounts for the fact that multiple leaves might be intersected. However, this implementation is 36x slower than implementation #1 (okay, I miss some intersection tests in #1, but still...).

bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);

for (int i = 1; i < boxes_count; i++) {
    if (hits[boxes[i].parent]) {
        hits[i] = intersect_box(boxes[i], ray);
    } else {
        hits[i] = false;
    }
}

for (int i = 0; i < boxes_count; i++) {
    if (!hits[i]) {
        continue;
    }

    // only leaves have ids_offset and ids_count defined (not set to -1)
    if (boxes[i].ids_offset < 0) {
        continue;
    }

    uint a = boxes[i].ids_offset;
    uint b = a + boxes[i].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}


This performance difference drives me crazy. It seems only having a single statement like if(dynamically_modified_array[some_index]) has a huge impact on performance. I suspect that the SPIR-V or GPU compiler is no longer able to do its optimization magic? So here are my questions:

  1. Is this indeed an optimization problem?

  2. If yes, can I transform implementation #2 to be better optimizable? Can I somehow give optimization hints?

  3. Is there a standard way to implement BVH tree queries in shaders?

解决方案

After some digging, I found a solution. Important to understand is that the BVH tree does not exclude the possibility that one needs to evaluate all leaves.

Implementation #3 below, uses hit and miss links. The boxes need to be sorted in a way that in the worst case all of them are queried in the correct order (so a single loop is enough). However, links are used to skip nodes which don't need to be evaluated. When the current node is a leaf node, the actual triangle intersections are performed.

  • hit link ~ which node to jump to in case of a hit (green below)
  • miss link ~ which node to jump to in case of a miss (red below)

Image taken from here. The associated paper and source code is also on Prof. Toshiya Hachisuka's page. The same concept is also described in this paper referenced in the slides.


#3 BVH Tree with Hit and Miss Links

I had to extend the data which is pushed to the shader with the links. Also some offline fiddling was required to store the tree correctly. At first I tried using a while loop (loop until box_index_next is -1) which resulted in a crazy slowdown again. Anyway, the following works reasonably fast:

int box_index_next = 0;

for (int box_index = 0; box_index < boxes_count; box_index++) {
    if (box_index != box_index_next) {
        continue;
    }

    bool hit = intersect_box(boxes[box_index], ray);
    bool leaf = boxes[box_index].ids_count > 0;

    if (hit) {
        box_index_next = boxes[box_index].links.x; // hit link
    } else {
        box_index_next = boxes[box_index].links.y; // miss link
    }

    if (hit && leaf) {
        uint a = boxes[box_index].ids_offset;
        uint b = a + boxes[box_index].ids_count;

        for (uint j = a; j < b; j++) {
            uint triangle_id = triangle_references[j];
            // triangle intersection code ...
        }
    }
}


This code is about 3x slower than the fast, but flawed implementation #1. This is somewhat expected, now the speed depends on the actual tree, not on the gpu optimization. Consider, for example, a degenerate case where triangles are aligned along an axis: a ray in the same direction might intersect with all triangles, then all tree leaves need to be evaluated.

Prof. Toshiya Hachisuka proposes a further optimization for such cases in his sildes (page 36 and onward): One stores multiple versions of the BVH tree, spatially sorted along x, -x, y, -y, z and -z. For traversal the correct version needs to be selected based on the ray. Then one can stop the traversal as soon as a triangle from a leaf is intersected, since all remaining nodes to be visited will be spatially behind this node (from the ray point of view).


Once the BVH tree is built, finding the links is quite straightforward (some python code below):

class NodeAABB(object):

    def __init__(self, obj_bounds, obj_ids):
        self.children = [None, None]
        self.obj_bounds = obj_bounds
        self.obj_ids = obj_ids

    def split(self):
        # split recursively and create children here
        raise NotImplementedError()

    def is_leaf(self):
        return set(self.children) == {None}

    def build_links(self, next_right_node=None):
        if not self.is_leaf():
            child1, child2 = self.children

            self.hit_node = child1
            self.miss_node = next_right_node

            child1.build_links(next_right_node=child2)
            child2.build_links(next_right_node=next_right_node)

        else:
            self.hit_node = next_right_node
            self.miss_node = self.hit_node

    def collect(self):
        # retrieve in depth first fashion for correct order
        yield self
        if not self.is_leaf():
            child1, child2 = self.children
            yield from child1.collect()
            yield from child2.collect()

After you store all AABBs in an array (which will be sent to the GPU) you can use hit_node and miss_node to look up the indices for the links and store them as well.

这篇关于遍历着色器中的界层层次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆