用于二维碰撞检测的四叉树的高效(并得到很好的解释)实现 [英] Efficient (and well explained) implementation of a Quadtree for 2D collision detection

查看:36
本文介绍了用于二维碰撞检测的四叉树的高效(并得到很好的解释)实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直致力于将四叉树添加到我正在编写的程序中,但我不禁注意到,对于我正在寻找的实现,几乎没有得到很好解释/执行的教程.

>

具体来说,我正在寻找在四叉树(检索、插入、删除等)中常用的方法和伪代码列表,用于说明如何实现它们(或只是对其过程的描述),以及一些提高性能的技巧.这是为了碰撞检测,所以最好用二维矩形来解释,因为它们是将被存储的对象.

解决方案

  1. 高效的四叉树


好的,我来试一试.首先是一个预告片,展示我将涉及 20,000 名代理的建议的结果(这只是我针对这个特定问题快速提出的):

GIF 的帧速率大幅降低,分辨率显着降低,以适合该网站的最大 2 MB.如果您想以接近全速的速度观看此内容,请观看以下视频:

20k 代理的模拟需要大约 3 兆字节的 RAM.我还可以轻松处理 10 万个较小的代理,而不会牺牲帧速率,尽管它会导致屏幕上出现一些混乱,以至于您几乎看不到上面 GIF 中发生的情况.这一切都只在我的 i7 上的一个线程中运行,根据 VTune,我几乎有一半的时间只是在屏幕上绘制这些东西(仅使用一些基本的标量指令在 CPU 中一次绘制一个像素的内容).

除了我们将包含 4 个连续元素而不是一次一个的内存块池化.这使得我们通常不需要在模拟过程中涉及任何堆分配或释放.一组 4 个节点被标记为不可分割地释放,然后在另一个叶节点的后续分裂中不可分割地回收.

延迟清理

我不会在删除元素后立即更新四叉树的结构.当我删除一个元素时,我只是沿着树下降到它所占据的子节点,然后删除该元素,但即使叶子变空,我也懒得再做任何事情.

相反,我进行了这样的延迟清理:

void Quadtree::cleanup(){//如果根不是叶子,则只处理根.SmallListto_process;如果(节点 [0].count == -1)to_process.push_back(0);而 (to_process.size() > 0){const int node_index = to_process.pop_back();四节点&节点=节点[节点索引];//循环遍历孩子.int num_empty_leaves = 0;for (int j=0; j <4; ++j){const int child_index = node.first_child + j;const QuadNode&child = 节点[child_index];//如果孩子是空的,则增加空叶计数//叶子.否则,如果孩子是一个分支,将其添加到//下一次迭代要处理的栈.如果(child.count == 0)++num_empty_leaves;否则如果(child.count == -1)to_process.push_back(child_index);}//如果所有的孩子都是空叶子,删除它们并//使这个节点成为新的空叶子.如果(num_empty_leaves == 4){//将所有 4 个孩子推送到空闲列表.节点[node.first_child].first_child = free_node;free_node = node.first_child;//使这个节点成为新的空叶子.node.first_child = -1;node.count = 0;}}}

在移动所有代理后,在每一帧结束时调用此方法.我在多次迭代中执行这种延迟删除空叶节点而不是在删除单个元素的过程中一次全部删除的原因是元素 A 可能移动到节点 N2,使 N1 为空.但是,元素B可能会在同一帧中移动到N1,使其再次被占用.

通过延迟清理,我们可以处理这种情况,而无需不必要地移除子元素,而只是在另一个元素移动到该象限时将它们添加回来.

在我的例子中移动元素很简单:1) 移除元素,2) 移动它,3) 将它重新插入到四叉树中.在我们移动所有元素并在一帧结束时(不是时间步,每帧可能有多个时间步),上面的 cleanup 函数被调用以从具有4 个空叶子作为子节点,这有效地将该父节点变成了新的空叶子,然后可以在下一帧中使用后续的 cleanup 调用进行清理(或者如果有东西插入其中或者空叶的兄弟姐妹是非空的).

让我们直观地看一下延迟清理:

从这里开始,假设我们从树中删除了一些元素,留下了 4 个空叶子:

此时,如果我们调用cleanup,它会在找到4个空子叶时移除4个叶,将父叶变成空叶,如下所示:

假设我们删除了更多元素:

...然后再次调用cleanup:

如果我们再次调用它,我们会得到这样的结果:

... 在这一点上,根本身变成了一片空叶子.但是,cleanup 方法从不删除根(它只删除子项).再次以这种方式并在多个步骤中推迟这样做的主要目的是减少每个时间步骤可能发生的潜在冗余工作量(可能很多),如果我们每次从元素中删除元素时都立即执行此操作那个树.它还有助于跨帧分发工作以避免卡顿.

TBH,我最初应用了这个延迟清理";我纯粹出于懒惰而用 C 编写的 DOS 游戏中的技术!那时我不想费心从树上下来,删除元素,然后以自下而上的方式删除节点,因为我最初编写树是为了支持自上而下的遍历(而不是自上而下并再次备份)并且真的认为这种懒惰的解决方案是对生产力的妥协(牺牲最佳性能以更快地实施).然而,多年后,我实际上开始以立即开始删除节点的方式实施四叉树删除,令我惊讶的是,我实际上使速度变慢了,而且帧速率更加不可预测和断断续续.延迟清理虽然最初是受到我程序员懒惰的启发,但实际上(并且偶然地)对动态场景进行了非常有效的优化.

元素的单链索引列表

对于元素,我使用这种表示:

//代表四叉树中的一个元素.struct QuadElt{//存储元素的 ID(可用于//参考外部数据).内部标识;//存储元素的矩形.整数 x1, y1, x2, y2;};//代表四叉树中的一个元素节点.struct QuadEltNode{//指向叶节点中的下一个元素.值为 -1//表示列表的结尾.下一个;//存储元素索引.整数元素;};

我使用元素节点"它与元素"分开.无论元素占据多少个单元格,它都只会向四叉树插入一次.然而,对于它所占据的每个单元格,一个元素节点"被占用.插入索引该元素.

元素节点是一个单链索引列表节点成数组,同样使用上面的free list方法.与连续存储叶子的所有元素相比,这会导致更多的缓存未命中.然而,鉴于此四叉树用于非常动态的数据,每个时间步都在移动和碰撞,通常需要更多的处理时间来寻找叶元素的完美连续表示(您将有效地实现一个变量-大小的内存分配器,它非常快,这远不是一件容易的事情).所以我使用单链接索引列表,它允许分配/解除分配的空闲列表常量时间方法.当你使用这种表示时,你可以通过改变几个整数将元素从分裂的父元素转移到新的叶子.

SmallList

哦,我应该提到这一点.如果您不只是为了存储非递归遍历的临时节点堆栈而进行堆分配,这自然会有所帮助.SmallListvector 类似,不同之处在于它在插入超过 128 个元素之前不会涉及堆分配.它类似于 C++ 标准库中的 SBO 字符串优化.这是我实施并使用了很长时间的东西,它对确保您尽可能使用堆栈有很大帮助.

树表示

这是四叉树本身的表示:

struct Quadtree{//存储四叉树中的所有元素.FreeList埃尔茨;//存储四叉树中的所有元素节点.FreeListelt_nodes;//存储四叉树中的所有节点.这个节点的第一个//序列始终是根.std::vector节点;//存储四叉树的范围.QuadCRect root_rect;//将要回收的四叉树中的第一个空闲节点存储为 4//一次连续的节点.值为 -1 表示免费//列表为空,此时我们只需向列表中插入 4 个节点//节点数组的后面.int free_node;//存储四叉树允许的最大深度.int max_depth;};

如上所述,我们为根存储一个矩形 (root_rect).所有子矩形都是即时计算的.所有节点与元素和元素节点(在 FreeList 中)一起存储在一个数组 (std::vector) 中.

FreeList

我使用 FreeList 数据结构,它基本上是一个数组(和随机访问序列),它允许您在恒定时间内从任何地方删除元素(留下孔洞,在后续插入时回收这些孔洞)恒定时间).这是一个简化的版本,它不会处理非平凡的数据类型(不使用新放置或手动销毁调用):

///提供一个索引的空闲列表,可以从任何地方恒定时间删除///在没有使索引无效的列表中.T 必须是可简单构造的///和可破坏的.模板类 FreeList{上市:///创建一个新的空闲列表.自由列表();///向空闲列表插入一个元素并返回一个索引.int insert(const T& element);//从空闲列表中删除第 n 个元素.无效擦除(int n);//从空闲列表中删除所有元素.空清除();//返回有效索引的范围.int range() const;//返回第 n 个元素.夯;运算符[](int n);//返回第 n 个元素.const T&运算符 [](int n) 常量;私人的:联合自由元素{T元素;下一个;};std::vector数据;int first_free;};模板FreeList::FreeList(): first_free(-1){}模板int FreeList::insert(const T& element){如果(first_free != -1){const int index = first_free;first_free = 数据[first_free].next;数据[索引].元素=元素;回报指数;}别的{FreeElement fe;fe.element = 元素;data.push_back(fe);返回 static_cast(data.size() - 1);}}模板void FreeList::erase(int n){数据[n].next = first_free;first_free = n;}模板void FreeList::clear(){数据清除();first_free = -1;}模板int FreeList::range() const{return static_cast(data.size());}模板夯;FreeList::operator[](int n){返回数据[n].元素;}模板const T&FreeList::operator[](int n) const{返回数据[n].元素;}

我有一个可以处理非平凡类型并提供迭代器等的,但涉及更多.这些天我倾向于更多地使用可简单构造/可破坏的 C 样式结构(仅将非平凡的用户定义类型用于高级内容).

最大树深

我通过指定允许的最大深度来防止树细分太多.为了快速模拟,我使用了 8.对我来说,这再次至关重要,因为在 VFX 中,我遇到了很多病态的情况,包括由艺术家创建的具有大量重合或重叠元素的内容,这些元素没有最大树深度限制,可以希望它无限期地细分.

如果您想要在允许的最大深度以及在分裂成 4 个子元素之前允许在叶中存储多少元素方面的最佳性能,则需要进行一些微调.我倾向于发现最佳结果是在每个节点分裂之前最多 8 个元素,以及最大深度集,以便最小单元大小略高于平均代理的大小(否则可以插入更多单个代理成多片叶子).

冲突和查询

有几种方法可以进行碰撞检测和查询.我经常看到人们这样做:

对于场景中的每个元素:使用四叉树检查与其他元素的碰撞

这非常简单,但这种方法的问题在于场景中的第一个元素可能与第二个元素位于世界上完全不同的位置.结果,我们沿着四叉树走的路径可能完全是零星的.我们可能会遍历一条路径到一片叶子,然后想要再次沿着相同的路径找到第一个元素,比如第 50,000 个元素.此时,该路径中涉及的节点可能已经从 CPU 缓存中逐出.所以我建议这样做:

遍历 = {}收集四叉树的叶子对于叶子中的每一片叶子:{对于叶子中的每个元素:{如果没有遍历[元素]:{使用四叉树检查与其他元素的碰撞遍历[元素] = true}}}

虽然这是相当多的代码并且需要我们保持一个遍历位集或某种并行数组以避免处理元素两次(因为它们可能被插入到多个叶中),但它有帮助确保我们在整个循环中沿着四叉树沿着相同的路径下降.这有助于使事情对缓存更加友好.此外,如果在时间步骤中尝试移动元素后,它仍然完全包含在该叶子节点中,我们甚至不需要从根节点再次返回(我们只需检查那个叶子节点).

常见的低效率:要避免的事情

虽然有很多方法可以给猫剥皮并实现有效的解决方案,但有一种常见的方法可以实现非常低效的解决方案.在我从事 VFX 的职业生涯中,我遇到了非常低效的四叉树、kd 树和八叉树.我们正在谈论 1 GB 的内存使用量,仅用于划分具有 100k 三角形的网格,而构建需要 30 秒,而一个体面的实现应该能够每秒执行数百次相同的操作,并且只需要几兆.有很多人为了解决问题而提出这些问题,他们是理论奇才,但不太关注内存效率.

所以我看到的绝对最常见的禁忌是为每个树节点存储一个或多个成熟的容器.所谓成熟的容器,我的意思是拥有、分配和释放自己的内存的东西,就像这样:

struct Node{...//将元素存储在节点中.列表<元素>元素;};

List 在 Python 中可能是一个列表,在 Java 或 C# 中可能是一个 ArrayList,在 C++ 中可能是 std::vector,etc:一些管理自己内存/资源的数据结构.

这里的问题是,虽然这样的容器在存储大量元素方面非常有效,所有在所有语言中都是非常低效的,如果你实例化一大堆它们只是为了存储一个每一个元素都很少.原因之一是容器元数据在单个树节点的这种粒度级别上的内存使用量往往非常爆炸.容器可能需要存储大小、容量、指向它分配的数据的指针/引用等,所有这些都是为了通用目的,因此它可能使用 64 位整数来表示大小和容量.因此,仅用于空容器的元数据可能是 24 个字节,这已经是我提出的整个节点表示的 3 倍,而这仅适用于设计用于在叶子中存储元素的空容器.

此外,每个容器通常希望在插入时进行堆/GC 分配,或者需要更多的预先分配的内存(此时仅容器本身可能需要 64 个字节).所以这要么因为所有的分配而变慢(请注意,GC 分配最初在某些实现中非常快,例如 JVM,但这仅适用于初始突发 Eden 周期),或者因为我们正在招致大量缓存未命中,因为节点是如此之大,以至于在遍历时几乎不适合较低级别的 CPU 缓存,或两者兼而有之.

然而,这是一个非常自然的倾向并且具有直观意义,因为我们从理论上使用语言来谈论这些结构,例如元素存储在叶子中",这表明在叶子中存储元素容器节点.不幸的是,它在内存使用和处理方面具有爆炸性的成本.因此,如果希望创建一些合理有效的东西,请避免这种情况.使 Node 共享并指向(引用)或为整个树分配和存储的索引内存,而不是为每个单独的节点.实际上,元素不应该存储在叶子中.

<块引用>

元素应该存储在中,叶子节点应该索引指向这些元素.

结论

呼,所以这些是我为实现被认为是性能不错的解决方案而做的主要事情.我希望这有帮助.请注意,对于已经至少实现过一两次四叉树的人来说,我的目标是更高的水平.如有任何问题,欢迎随时拍.

由于这个问题有点宽泛,如果它没有被关闭,我可能会来编辑它并随着时间的推移不断调整和扩展它(我喜欢这些类型的问题,因为它们给了我们一个借口来写我们的经历在该领域工作,但该网站并不总是喜欢它们).也希望有高手可能会加入可供我学习的替代解决方案,并可能用于进一步改进我的解决方案.

四叉树实际上并不是我最喜欢的这种极端动态碰撞场景的数据结构.所以我可能有一两件事要向那些为此目的而喜欢四叉树并且多年来一直在调整和调整它们的人学习.大多数情况下,我将四叉树用于不会在每一帧周围移动的静态数据,而对于那些我使用与上面建议的非常不同的表示.

I've been working on adding a Quadtree to a program that I'm writing, and I can't help but notice that there are few well explained/performing tutorials for the implementation that I'm looking for.

Specifically, a list of the methods and pseudocode for how to implement them (or just a description of their processes) that are commonly used in a Quadtree (retrieve, insert, remove, etc.) is what I'm looking for, along with maybe some tips to improve performance. This is for collision detection, so it'd be best to be explained with 2d rectangles in mind, as they are the objects that will be stored.

解决方案

  1. Efficient Quadtrees


All right, I'll take a shot at this. First a teaser to show the results of what I'll propose involving 20,000 agents (just something I whipped up real quick for this specific question):

The GIF has extremely reduced frame rate and significantly lower res to fit the 2 MB maximum for this site. Here's a video if you want to see the thing at close to full speed: https://streamable.com/3pgmn.

And a GIF with 100k though I had to fiddle with it so much and had to turn off the quadtree lines (didn't seem to want to compress as much with them on) as well as change the way the agents looked to get it to fit in 2 megabytes (I wish making a GIF was as easy as coding a quadtree):

The simulation with 20k agents takes ~3 megabytes of RAM. I can also easily handle 100k smaller agents without sacrificing frame rates, though it leads to a bit of a mess on the screen to the point where you can barely see what's going on as in the GIF above. This is all running in just one thread on my i7 and I'm spending almost half the time according to VTune just drawing this stuff on the screen (just using some basic scalar instructions to plot things one pixel at a time in CPU).

And here's a video with 100,000 agents though it's hard to see what's going on. It's kind of a big video and I couldn't find any decent way to compress it without the whole video turning into a mush (might need to download or cache it first to see it stream at reasonable FPS). With 100k agents the simulation takes around 4.5 megabytes of RAM and the memory use is very stable after running the simulation for about 5 seconds (stops going up or down since it ceases to heap allocate). Same thing in slow motion.

Efficient Quadtree for Collision Detection

All right, so actually quadtrees are not my favorite data structure for this purpose. I tend to prefer a grid hierarchy, like a coarse grid for the world, a finer grid for a region, and an even finer grid for a sub-region (3 fixed levels of dense grids, and no trees involved), with row-based optimizations so that a row that has no entities in it will be deallocated and turned into a null pointer, and likewise completely empty regions or sub-regions turned into nulls. While this simple implementation of the quadtree running in one thread can handle 100k agents on my i7 at 60+ FPS, I've implemented grids that can handle a couple million agents bouncing off each other every frame on older hardware (an i3). Also I always liked how grids made it very easy to predict how much memory they'll require, since they don't subdivide cells. But I'll try to cover how to implement a reasonably efficient quadtree.

Note that I won't go into the full theory of the data structure. I'm assuming that you already know that and are interested in improving performance. I'm also just going into my personal way of tackling this problem which seems to outperform most of the solutions I find online for my cases, but there are many decent ways and these solutions are tailor-fitted to my use cases (very large inputs with everything moving every frame for visual FX in films and television). Other people probably optimize for different use cases. When it comes to spatial indexing structures in particular, I really think the efficiency of the solution tells you more about the implementer than the data structure. Also the same strategies I'll propose to speeding things up also apply in 3 dimensions with octrees.

Node Representation

So first of all, let's cover the node representation:

// Represents a node in the quadtree.
struct QuadNode
{
    // Points to the first child if this node is a branch or the first
    // element if this node is a leaf.
    int32_t first_child;

    // Stores the number of elements in the leaf or -1 if it this node is
    // not a leaf.
    int32_t count;
};

It's a total of 8 bytes, and this is very important as it's a key part of the speed. I actually use a smaller one (6 bytes per node) but I'll leave that as an exercise to the reader.

You can probably do without the count. I include that for pathological cases to avoid linearly traversing the elements and counting them each time a leaf node might split. In most common cases a node shouldn't store that many elements. However, I work in visual FX and the pathological cases aren't necessarily rare. You can encounter artists creating content with a boatload of coincident points, massive polygons that span the entire scene, etc, so I end up storing a count.

Where are the AABBs?

So one of the first things people might be wondering is where the bounding boxes (rectangles) are for the nodes. I don't store them. I compute them on the fly. I'm kinda surprised most people don't do that in the code I've seen. For me, they're only stored with the tree structure (basically just one AABB for the root).

That might seem like it'd be more expensive to be computing these on the fly, but reducing the memory use of a node can proportionally reduce cache misses when you traverse the tree, and those cache miss reductions tend to be more significant than having to do a couple of bitshifts and some additions/subtractions during traversal. Traversal looks like this:

static QuadNodeList find_leaves(const Quadtree& tree, const QuadNodeData& root, const int rect[4])
{
    QuadNodeList leaves, to_process;
    to_process.push_back(root);
    while (to_process.size() > 0)
    {
        const QuadNodeData nd = to_process.pop_back();

        // If this node is a leaf, insert it to the list.
        if (tree.nodes[nd.index].count != -1)
            leaves.push_back(nd);
        else
        {
            // Otherwise push the children that intersect the rectangle.
            const int mx = nd.crect[0], my = nd.crect[1];
            const int hx = nd.crect[2] >> 1, hy = nd.crect[3] >> 1;
            const int fc = tree.nodes[nd.index].first_child;
            const int l = mx-hx, t = my-hx, r = mx+hx, b = my+hy;

            if (rect[1] <= my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,t, hx, hy, fc+0, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,t, hx, hy, fc+1, nd.depth+1));
            }
            if (rect[3] > my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,b, hx, hy, fc+2, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,b, hx, hy, fc+3, nd.depth+1));
            }
        }
    }
    return leaves;
}

Omitting the AABBs is one of the most unusual things I do (I keep looking for other people doing it just to find a peer and fail), but I've measured the before and after and it did reduce times considerably, at least on very large inputs, to compact the quadtree node substantially and just compute AABBs on the fly during traversal. Space and time aren't always diametrically opposed. Sometimes reducing space also means reducing time given how much performance is dominated by the memory hierarchy these days. I've even sped up some real world operations applied on massive inputs by compressing the data to quarter the memory use and decompressing on the fly.

I don't know why many people choose to cache the AABBs: whether it's programming convenience or if it's genuinely faster in their cases. Yet for data structures which split evenly down the center like regular quadtrees and octrees, I'd suggest measuring the impact of omitting the AABBs and computing them on the fly. You might be quite surprised. Of course it makes sense to store AABBs for structures that don't split evenly like Kd-trees and BVHs as well as loose quadtrees.

Floating-Point

I don't use floating-point for spatial indexes and maybe that's why I see improved performance just computing AABBs on the fly with right shifts for division by 2 and so forth. That said, at least SPFP seems really fast nowadays. I don't know since I haven't measured the difference. I just use integers by preference even though I'm generally working with floating-point inputs (mesh vertices, particles, etc). I just end up converting them to integer coordinates for the purpose of partitioning and performing spatial queries. I'm not sure if there's any major speed benefit of doing this anymore. It's just a habit and preference since I find it easier to reason about integers without having to think about denormalized FP and all that.

Centered AABBs

While I only store a bounding box for the root, it helps to use a representation that stores a center and half size for nodes while using a left/top/right/bottom representation for queries to minimize the amount of arithmetic involved.

Contiguous Children

This is likewise key, and if we refer back to the node rep:

struct QuadNode
{
    int32_t first_child;
    ...
};

We don't need to store an array of children because all 4 children are contiguous:

first_child+0 = index to 1st child (TL)
first_child+1 = index to 2nd child (TR)
first_child+2 = index to 3nd child (BL)
first_child+3 = index to 4th child (BR)

This not only significantly reduces cache misses on traversal but also allows us to significantly shrink our nodes which further reduces cache misses, storing only one 32-bit index (4 bytes) instead of an array of 4 (16 bytes).

This does mean that if you need to transfer elements to just a couple of quadrants of a parent when it splits, it must still allocate all 4 child leaves to store elements in just two quadrants while having two empty leaves as children. However, the trade-off is more than worth it performance-wise at least in my use cases, and remember that a node only takes 8 bytes given how much we've compacted it.

When deallocating children, we deallocate all four at a time. I do this in constant-time using an indexed free list, like so:

Except we're pooling out memory chunks containing 4 contiguous elements instead of one at a time. This makes it so we usually don't need to involve any heap allocations or deallocations during the simulation. A group of 4 nodes gets marked as freed indivisibly only to then be reclaimed indivisibly in a subsequent split of another leaf node.

Deferred Cleanup

I don't update the quadtree's structure right away upon removing elements. When I remove an element, I just descend down the tree to the child node(s) it occupies and then remove the element, but I don't bother to do anything more just yet even if the leaves become empty.

Instead I do a deferred cleanup like this:

void Quadtree::cleanup()
{
    // Only process the root if it's not a leaf.
    SmallList<int> to_process;
    if (nodes[0].count == -1)
        to_process.push_back(0);

    while (to_process.size() > 0)
    {
        const int node_index = to_process.pop_back();
        QuadNode& node = nodes[node_index];

        // Loop through the children.
        int num_empty_leaves = 0;
        for (int j=0; j < 4; ++j)
        {
            const int child_index = node.first_child + j;
            const QuadNode& child = nodes[child_index];

            // Increment empty leaf count if the child is an empty 
            // leaf. Otherwise if the child is a branch, add it to
            // the stack to be processed in the next iteration.
            if (child.count == 0)
                ++num_empty_leaves;
            else if (child.count == -1)
                to_process.push_back(child_index);
        }

        // If all the children were empty leaves, remove them and 
        // make this node the new empty leaf.
        if (num_empty_leaves == 4)
        {
            // Push all 4 children to the free list.
            nodes[node.first_child].first_child = free_node;
            free_node = node.first_child;

            // Make this node the new empty leaf.
            node.first_child = -1;
            node.count = 0;
        }
    }
}

This is called at the end of every single frame after moving all the agents. The reason I do this kind of deferred removal of empty leaf nodes in multiple iterations and not all at once in the process of removing a single element is that element A might move to node N2, making N1 empty. However, element B might, in the same frame, move to N1, making it occupied again.

With the deferred cleanup, we can handle such cases without unnecessarily removing children only to add them right back when another element moves into that quadrant.

Moving elements in my case is a straightforward: 1) remove element, 2) move it, 3) reinsert it to the quadtree. After we move all the elements and at the end of a frame (not time step, there could be multiple time steps per frame), the cleanup function above is called to remove the children from a parent which has 4 empty leaves as children, which effectively turns that parent into the new empty leaf which might then be cleaned up in the next frame with a subsequent cleanup call (or not if things get inserted to it or if the empty leaf's siblings are non-empty).

Let's look at the deferred cleanup visually:

Starting with this, let's say we remove some elements from the tree leaving us with 4 empty leaves:

At this point, if we call cleanup, it will remove 4 leaves if it finds 4 empty child leaves and turn the parent into an empty leaf, like so:

Let's say we remove some more elements:

... and then call cleanup again:

If we call it yet again, we end up with this:

... at which point the root itself turns into an empty leaf. However, the cleanup method never removes the root (it only removes children). Again the main point of doing it deferred this way and in multiple steps is to reduce the amount of potential redundant work that could occur per time step (which can be a lot) if we did this all immediately every single time an element is removed from the tree. It also helps to distribute that works across frames to avoid stutters.

TBH, I originally applied this "deferred cleanup" technique in a DOS game I wrote in C out of sheer laziness! I didn't want to bother with descending down the tree, removing elements, and then removing nodes in a bottom-up fashion back then because I originally wrote the tree to favor top-down traversal (not top-down and back up again) and really thought this lazy solution was a productivity compromise (sacrificing optimal performance to get implemented faster). However, many years later, I actually got around to implementing quadtree removal in ways that immediately started removing nodes and, to my surprise, I actually significantly made it slower with more unpredictable and stuttery frame rates. The deferred cleanup, in spite of originally being inspired by my programmer laziness, was actually (and accidentally) a very effective optimization for dynamic scenes.

Singly-Linked Index Lists for Elements

For elements, I use this representation:

// Represents an element in the quadtree.
struct QuadElt
{
    // Stores the ID for the element (can be used to
    // refer to external data).
    int id;

    // Stores the rectangle for the element.
    int x1, y1, x2, y2;
};

// Represents an element node in the quadtree.
struct QuadEltNode
{
    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    int next;

    // Stores the element index.
    int element;
};

I use an "element node" which is separate from "element". An element is only inserted once to the quadtree no matter how many cells it occupies. However, for each cell it occupies, an "element node" is inserted which indexes that element.

The element node is a singly-linked index list node into an array, and also using the free list method above. This incurs some more cache misses over storing all the elements contiguously for a leaf. However, given that this quadtree is for very dynamic data which is moving and colliding every single time step, it generally takes more processing time than it saves to seek out a perfectly contiguous representation for the leaf elements (you would effectively have to implement a variable-sized memory allocator which is really fast, and that's far from an easy thing to do). So I use the singly-linked index list which allows a free list constant-time approach to allocation/deallocation. When you use that representation, you can transfer elements from split parents to new leaves by just changing a few integers.

SmallList<T>

Oh, I should mention this. Naturally it helps if you don't heap allocate just to store a temporary stack of nodes for non-recursive traversal. SmallList<T> is similar to vector<T> except it won't involve a heap allocation until you insert more than 128 elements to it. It's similar to SBO string optimizations in the C++ standard lib. It's something I implemented and have been using for ages and it does help a lot to make sure you use the stack whenever possible.

Tree Representation

Here's the representation of the quadtree itself:

struct Quadtree
{
    // Stores all the elements in the quadtree.
    FreeList<QuadElt> elts;

    // Stores all the element nodes in the quadtree.
    FreeList<QuadEltNode> elt_nodes;

    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    std::vector<QuadNode> nodes;

    // Stores the quadtree extents.
    QuadCRect root_rect;

    // Stores the first free node in the quadtree to be reclaimed as 4
    // contiguous nodes at once. A value of -1 indicates that the free
    // list is empty, at which point we simply insert 4 nodes to the
    // back of the nodes array.
    int free_node;

    // Stores the maximum depth allowed for the quadtree.
    int max_depth;
};

As pointed out above, we store a single rectangle for the root (root_rect). All sub-rects are computed on the fly. All nodes are stored in contiguously in an array (std::vector<QuadNode>) along with the elements and element nodes (in FreeList<T>).

FreeList<T>

I use a FreeList data structure which is basically an array (and random-access sequence) that lets you remove elements from anywhere in constant-time (leaving holes behind which get reclaimed upon subsequent insertions in constant-time). Here's a simplified version which doesn't bother with handling non-trivial data types (doesn't use placement new or manual destruction calls):

/// Provides an indexed free list with constant-time removals from anywhere
/// in the list without invalidating indices. T must be trivially constructible 
/// and destructible.
template <class T>
class FreeList
{
public:
    /// Creates a new free list.
    FreeList();

    /// Inserts an element to the free list and returns an index to it.
    int insert(const T& element);

    // Removes the nth element from the free list.
    void erase(int n);

    // Removes all elements from the free list.
    void clear();

    // Returns the range of valid indices.
    int range() const;

    // Returns the nth element.
    T& operator[](int n);

    // Returns the nth element.
    const T& operator[](int n) const;

private:
    union FreeElement
    {
        T element;
        int next;
    };
    std::vector<FreeElement> data;
    int first_free;
};

template <class T>
FreeList<T>::FreeList(): first_free(-1)
{
}

template <class T>
int FreeList<T>::insert(const T& element)
{
    if (first_free != -1)
    {
        const int index = first_free;
        first_free = data[first_free].next;
        data[index].element = element;
        return index;
    }
    else
    {
        FreeElement fe;
        fe.element = element;
        data.push_back(fe);
        return static_cast<int>(data.size() - 1);
    }
}

template <class T>
void FreeList<T>::erase(int n)
{
    data[n].next = first_free;
    first_free = n;
}

template <class T>
void FreeList<T>::clear()
{
    data.clear();
    first_free = -1;
}

template <class T>
int FreeList<T>::range() const
{
    return static_cast<int>(data.size());
}

template <class T>
T& FreeList<T>::operator[](int n)
{
    return data[n].element;
}

template <class T>
const T& FreeList<T>::operator[](int n) const
{
    return data[n].element;
}

I have one which does work with non-trivial types and provides iterators and so forth but is much more involved. These days I tend to work more with trivially constructible/destructible C-style structs anyway (only using non-trivial user-defined types for high-level stuff).

Maximum Tree Depth

I prevent the tree from subdividing too much by specifying a max depth allowed. For the quick simulation I whipped up, I used 8. For me this is crucial since again, in VFX I encounter pathological cases a lot including content created by artists with lots of coincident or overlapping elements which, without a maximum tree depth limit, could want it to subdivide indefinitely.

There is a bit of fine-tuning if you want optimal performance with respect to max depth allowed and how many elements you allow to be stored in a leaf before it splits into 4 children. I tend to find the optimal results are gained with something around 8 elements max per node before it splits, and a max depth set so that the smallest cell size is a little over the size of your average agent (otherwise more single agents could be inserted into multiple leaves).

Collision and Queries

There are a couple of ways to do the collision detection and queries. I often see people do it like this:

for each element in scene:
     use quad tree to check for collision against other elements

This is very straightforward but the problem with this approach is that the first element in the scene might be in a totally different location in the world from the second. As a result, the paths we take down the quadtree could be totally sporadic. We might traverse one path to a leaf and then want to go down that same path again for the first element as, say, the 50,000th element. By this time the nodes involved in that path may have already been evicted from the CPU cache. So I recommend doing it this way:

traversed = {}
gather quadtree leaves
for each leaf in leaves:
{
     for each element in leaf:
     {
          if not traversed[element]:
          {
              use quad tree to check for collision against other elements
              traversed[element] = true                  
          }
     }
}

While that's quite a bit more code and requires we keep a traversed bitset or parallel array of some sort to avoid processing elements twice (since they may be inserted in more than one leaf), it helps make sure that we descend the same paths down the quadtree throughout the loop. That helps keep things much more cache-friendly. Also if after attempting to move the element in the time step, it's still encompassed entirely in that leaf node, we don't even need to work our way back up again from the root (we can just check that one leaf only).

Common Inefficiencies: Things to Avoid

While there are many ways to skin the cat and achieve an efficient solution, there is a common way to achieve a very inefficient solution. And I've encountered my share of very inefficient quadtrees, kd trees, and octrees in my career working in VFX. We're talking over a gigabyte of memory use just to partition a mesh with 100k triangles while taking 30 secs to build, when a decent implementation should be able to do the same hundreds of times a second and just take a few megs. There are many people whipping these up to solve problems who are theoretical wizards but didn't pay much attention to memory efficiency.

So the absolute most common no-no I see is to store one or more full-blown containers with each tree node. By full-blown container, I mean something that owns and allocates and frees its own memory, like this:

struct Node
{
     ...

     // Stores the elements in the node.
     List<Element> elements;
};

And List<Element> might be a list in Python, an ArrayList in Java or C#, std::vector in C++, etc: some data structure that manages its own memory/resources.

The problem here is that while such containers are very efficiently implemented for storing a large number of elements, all of them in all languages are extremely inefficient if you instantiate a boatload of them only to store a few elements in each one. One of the reasons is that the container metadata tends to be quite explosive in memory usage at such a granular level of a single tree node. A container might need to store a size, capacity, a pointer/reference to data it allocates, etc, and all for a generalized purpose so it might use 64-bit integers for size and capacity. As a result the metadata just for an empty container might be 24 bytes which is already 3 times larger than the entirety of the node representation I proposed, and that's just for an empty container designed to store elements in leaves.

Furthermore each container often wants to either heap/GC-allocate on insertion or require even more preallocated memory in advance (at which point it might take 64 bytes just for the container itself). So that either becomes slow because of all the allocations (note that GC allocations are really fast initially in some implementations like JVM, but that's only for the initial burst Eden cycle) or because we're incurring a boatload of cache misses because the nodes are so huge that barely any fit into the lower levels of the CPU cache on traversal, or both.

Yet this is a very natural inclination and makes intuitive sense since we talk about these structures theoretically using language like, "Elements are stored in the leaves" which suggests storing a container of elements in leaf nodes. Unfortunately it has an explosive cost in terms of memory use and processing. So avoid this if the desire is to create something reasonably efficient. Make the Node share and point to (refer to) or index memory allocated and stored for the entire tree, not for every single individual node. In actuality the elements shouldn't be stored in the leaves.

Elements should be stored in the tree and leaf nodes should index or point to those elements.

Conclusion

Phew, so those are the main things I do to achieve what is hopefully considered a decent-performing solution. I hope that helps. Note that I am aiming this at a somewhat advanced level for people who have already implemented quadtrees at least once or twice. If you have any questions, feel free to shoot.

Since this question is a bit broad, I might come and edit it and keep tweaking and expanding it over time if it doesn't get closed (I love these types of questions since they give us an excuse to write about our experiences working in the field, but the site doesn't always like them). I'm also hoping some experts might jump in with alternative solutions I can learn from and perhaps use to improve mine further.

Again quadtrees aren't actually my favorite data structure for extremely dynamic collision scenarios like this. So I probably have a thing or two to learn from people who do favor quadtrees for this purpose and have been tweaking and tuning them for years. Mostly I use quadtrees for static data that doesn't move around every frame, and for those I use a very different representation from the one proposed above.

这篇关于用于二维碰撞检测的四叉树的高效(并得到很好的解释)实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆