四叉树用于2D碰撞检测的高效(且经过充分解释)的实现 [英] Efficient (and well explained) implementation of a Quadtree for 2D collision detection

查看:740
本文介绍了四叉树用于2D碰撞检测的高效(且经过充分解释)的实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力向正在编写的程序中添加四叉树,但我不禁注意到,对于正在寻找的实现,很少有很好的讲解/执行的教程.

具体来说,我要寻找的是如何在四叉树(检索,插入,删除等)中常用的方法和伪代码(如何实现它们的描述)的列表. ,也许还有一些提高性能的技巧.这是用于碰撞检测的,因此最好在考虑二维矩形的情况下进行解释,因为它们是将要存储的对象.

解决方案

1.高效的四叉树

好的,我会为此开枪.首先是一个预告片,以显示我将建议的涉及20,000名特工的结果(这是我针对此特定问题快速提出的建议):

GIF的帧速率大大降低,分辨率大大降低,以适合该站点的最大2 MB.如果您想以最快的速度看东西,请看以下视频: https://streamable.com/3pgmn

还有一个100k的GIF,尽管我不得不摆弄很多,并且不得不关闭四叉树线(似乎并不想将它们压缩得那么多),以及更改代理的查找方式使其适合2兆字节(我希望制作GIF像编码四叉树一样容易):

使用20k代理进行仿真需要约3 MB的RAM.我还可以轻松处理10万个较小的代理,而无需牺牲帧速率,尽管这样会导致屏幕混乱,甚至无法像上面的GIF一样看到正在发生的事情.这一切都在我的i7上仅在一个线程中运行,根据VTune,我只花了一半的时间在屏幕上绘制这些东西(只是使用一些基本的标量指令在CPU中一次绘制一个像素)./p>

这是一个有100,000名特工的视频,尽管很难知道发生了什么.这是一个很大的视频,在没有将整个视频变成糊糊的情况下,我找不到任何合适的压缩方式(可能需要先下载或缓存它才能以合理的FPS观看视频流).使用100k代理,模拟需要大约4.5 MB的RAM,并且在运行模拟大约5秒钟后内存使用非常稳定(停止上升或下降,因为它停止堆分配). 同一件事在慢动作.

用于碰撞检测的高效四叉树

好的,因此实际上四叉树并不是我最喜欢的数据结构.我倾向于使用网格层次结构,例如世界的粗网格,区域的精细网格和子区域的精细网格(3个固定级别的密集网格,并且不涉及任何树),并带有行-基于优化的操作,这样将取消分配其中没有实体的行并将其变为空指针,同样,完全空的区域或子区域也将变为空.虽然这种在一个线程中运行的四叉树的简单实现可以以60 FPS的速度在我的i7上处理10万个代理,但我实现了可以处理几百万个代理的网格,这些代理在较旧的硬件(一个i3)上每帧都相互反射.我也一直喜欢网格如何轻松地预测它们将需要多少内存,因为它们不细分单元.但是,我将尝试介绍如何实现合理有效的四叉树.

请注意,我不会涉及数据结构的全部理论.我假设您已经知道这一点,并且对提高性能感兴趣.我也只是以个人的方式来解决这个问题,这似乎比我在网上为我的案例找到的大多数解决方案都要好,但是有很多不错的方法,并且这些解决方案都是针对我的用例量身定制的(非常大的投入)电影和电视中的视觉特效都应运而生.其他人可能会针对不同的用例进行优化.特别是对于空间索引结构,我真的认为该解决方案的效率比数据结构更能告诉您有关实现者的信息.同样,我将建议的加快速度的相同策略也适用于3维度的八叉树.

节点表示形式

首先,让我们介绍一下节点表示形式:

 // Represents a node in the quadtree.
struct QuadNode
{
    // Points to the first child if this node is a branch or the first
    // element if this node is a leaf.
    int32_t first_child;

    // Stores the number of elements in the leaf or -1 if it this node is
    // not a leaf.
    int32_t count;
};
 

总共8个字节,这是非常重要的,因为它是速度的关键部分.我实际上使用的是一个较小的字节(每个节点6个字节),但我会将其留给读者练习.

您可能不需要count.对于病理情况,我将其包括在内,以避免线性遍历元素并在每次叶节点可能分裂时对其进行计数.在大多数情况下,节点不应存储那么多元素.但是,我从事视觉特效工作,病理情况不一定罕见.您会遇到艺术家创作大量重合点,跨越整个场景的大量多边形等内容的艺术家,因此我最终存储了count.

AABB在哪里?

所以人们可能想知道的第一件事就是边界框(矩形)在哪里.我不储存它们.我可以即时计算它们.令我惊讶的是,大多数人没有在我所看到的代码中这样做.对我来说,它们只存储在树结构中(基本上只有一个AABB作为根).

动态计算这些似乎更昂贵,但是减少节点的内存使用量可以按比例减少遍历树时的缓存未命中,并且这些缓存未命中的减少往往会更有意义而不需要在遍历过程中进行几次位移和一些加/减.遍历看起来像这样:

 static QuadNodeList find_leaves(const Quadtree& tree, const QuadNodeData& root, const int rect[4])
{
    QuadNodeList leaves, to_process;
    to_process.push_back(root);
    while (to_process.size() > 0)
    {
        const QuadNodeData nd = to_process.pop_back();

        // If this node is a leaf, insert it to the list.
        if (tree.nodes[nd.index].count != -1)
            leaves.push_back(nd);
        else
        {
            // Otherwise push the children that intersect the rectangle.
            const int mx = nd.crect[0], my = nd.crect[1];
            const int hx = nd.crect[2] >> 1, hy = nd.crect[3] >> 1;
            const int fc = tree.nodes[nd.index].first_child;
            const int l = mx-hx, t = my-hx, r = mx+hx, b = my+hy;

            if (rect[1] <= my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,t, hx, hy, fc+0, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,t, hx, hy, fc+1, nd.depth+1));
            }
            if (rect[3] > my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,b, hx, hy, fc+2, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,b, hx, hy, fc+3, nd.depth+1));
            }
        }
    }
    return leaves;
}
 

忽略AABB是我做过的最不寻常的事情之一(我一直在寻找其他人这样做只是为了找到一个同伴并失败),但是我已经进行了前后测量,并且确实减少了时间至少在非常大的输入上,以实质性压缩四叉树节点,并且仅在遍历过程中即时计算AABB.空间和时间并不总是截然相反的.鉴于如今内存层次决定了多少性能,有时减少空间还意味着减少时间.我什至通过压缩数据到四分之一的内存使用并即时解压缩来加速应用于大量输入的现实操作.

我不知道为什么很多人选择缓存AABB:是为了方便编程还是在他们的情况下真正更快.但是,对于像常规四叉树和八叉树这样的在中心均匀分布的数据结构,我建议测量省略AABB的影响并即时对其进行计算.您可能会很惊讶.当然,对于不均匀分裂的结构(如Kd树和BVH以及松散的四叉树),存储AABB有意义.

浮点数

我不对空间索引使用浮点数,也许这就是为什么我看到性能得到改善的原因,因为它可以动态计算AABB,并向右移动除以2,依此类推.就是说,至少SPFP在当今看来确实非常快.我不知道,因为我没有测量差异.尽管我通常使用浮点输入(网格顶点,粒子等),但我只是优先使用整数.我只是最终将它们转换为整数坐标,以便进行分区和执行空间查询.我不确定这样做是否还能带来主要的速度优势.这只是一种习惯和偏爱,因为我发现不用考虑非规范化FP等就可以更容易地对整数进行推理.

居中的AABB

虽然我只为根存储一个边界框,但它有助于使用一种存储节点中心和一半大小的表示形式,同时使用左/上/右/下表示形式来查询,以最大程度地减少所涉及的算术量.

连续的孩子

这同样是键,如果我们回头看节点rep:

 struct QuadNode
{
    int32_t first_child;
    ...
};
 

我们不需要存储所有子级数组,因为所有4个子级都是连续:

first_child+0 = index to 1st child (TL)
first_child+1 = index to 2nd child (TR)
first_child+2 = index to 3nd child (BL)
first_child+3 = index to 4th child (BR)

这不仅显着减少了遍历中的缓存丢失,而且还允许我们显着缩小节点,从而进一步减少了缓存丢失,仅存储一个32位索引(4个字节)而不是4个数组(16个字节)./p>

这确实意味着,如果您需要在拆分时将元素转移到父对象的几个象限,则它仍必须分配所有4个子叶以将元素存储在仅两个象限中,同时将两个空叶子作为子元素.但是,至少在我的用例中,权衡是值得的,至少在性能上值得这样做,并且请记住,鉴于我们对其进行了压缩,因此一个节点仅占用8个字节.

在分配子项时,我们一次将所有四个项都分配出去.我使用索引自由列表在固定时间内执行此操作,如下所示:

除了我们要分出包含4个连续元素而不是一次包含一个的内存块.这样就可以了,因此我们通常不需要在模拟过程中涉及任何堆分配或释放.一组4个节点仅被标记为已释放,然后在另一个叶节点的后续拆分中被分别回收.

延迟清除

删除元素后,我不会立即更新四叉树的结构.当我删除一个元素时,我只是将树下降到它所占据的子节点,然后删除该元素,但是即使叶子变空了,我也不想做任何其他事情.

相反,我像这样进行延迟清理:

 void Quadtree::cleanup()
{
    // Only process the root if it's not a leaf.
    SmallList<int> to_process;
    if (nodes[0].count == -1)
        to_process.push_back(0);

    while (to_process.size() > 0)
    {
        const int node_index = to_process.pop_back();
        QuadNode& node = nodes[node_index];

        // Loop through the children.
        int num_empty_leaves = 0;
        for (int j=0; j < 4; ++j)
        {
            const int child_index = node.first_child + j;
            const QuadNode& child = nodes[child_index];

            // Increment empty leaf count if the child is an empty 
            // leaf. Otherwise if the child is a branch, add it to
            // the stack to be processed in the next iteration.
            if (child.count == 0)
                ++num_empty_leaves;
            else if (child.count == -1)
                to_process.push_back(child_index);
        }

        // If all the children were empty leaves, remove them and 
        // make this node the new empty leaf.
        if (num_empty_leaves == 4)
        {
            // Push all 4 children to the free list.
            nodes[node.first_child].first_child = free_node;
            free_node = node.first_child;

            // Make this node the new empty leaf.
            node.first_child = -1;
            node.count = 0;
        }
    }
}
 

在移动所有座席之后的每个单帧结束时调用此方法.我之所以这样进行,是因为多次迭代而不是在删除单个元素的过程中一次全部删除,因此推迟了删除空叶节点的原因是,元素A可能会移动到节点N2,从而使N1为空.但是,元素B可能会在同一帧中移动到N1,使其再次被占用.

通过延迟清除,我们可以处理此类情况而不必删除子级,而只需在另一个元素移入该象限时立即将其添加回去即可.

在我的情况下,移动元素很简单:1)删除元素,2)移动元素,3)将其重新插入四叉树.在我们移动了所有元素之后,并在一帧的末尾(不是时间步长,每帧可能有多个时间步长),上面的cleanup函数被调用以从具有4个空叶作为父级的父级中删除子级,有效地将其父对象转换为新的空叶子,然后可以通过后续的cleanup调用在下一帧中对其进行清理(如果插入了东西,或者如果空叶子的兄弟姐妹不是空的,则不会清除)./p>

让我们目视看一下延迟的清理:

从这开始,假设我们从树中删除了一些元素,留下了4个空的叶子:

在这一点上,如果我们调用cleanup,它将在发现4个空子叶并将其父级变成空叶时删除4个叶,如下所示:

比方说,我们删除了更多元素:

...,然后再次调用cleanup:

如果再次调用它,我们将得到以下结果:

...在这一点上,根本身变成了一片空的叶子.但是,清除方法永远不会删除根(它只会删除子级).同样,这样做的主要要点是分多次执行,这是减少了每个时间步(可能很多)可能发生的潜在冗余工作量,前提是我们每次在每次删除元素时都立即执行所有这些工作那个树.它还有助于在各个帧之间分配有效的内容,从而避免结结巴巴.

TBH,我最初是出于纯粹的懒惰,在我用C语言编写的DOS游戏中应用了这种延迟清除"技术!我不想再顺着树下移,删除元素,然后以自下而上的方式删除节点的麻烦,因为我最初写树是为了支持自上而下的遍历(而不是自上而下并再次备份)并以为这个懒惰的解决方案是生产率的折衷(牺牲了最佳性能以更快地实施).但是,许多年后,我实际上开始采用立即开始删除节点的方式来实施四叉树删除,令我惊讶的是,实际上我使它变得更慢,并且帧速率更加不稳定且更加不稳定.尽管最初是由我的程序员懒惰启发的,但延迟的清理实际上(偶然)是对动态场景的非常有效的优化.

元素的单链接索引列表

对于元素,我使用以下表示形式:

 // Represents an element in the quadtree.
struct QuadElt
{
    // Stores the ID for the element (can be used to
    // refer to external data).
    int id;

    // Stores the rectangle for the element.
    int x1, y1, x2, y2;
};

// Represents an element node in the quadtree.
struct QuadEltNode
{
    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    int next;

    // Stores the element index.
    int element;
};
 

我使用与元素"分开的元素节点".元素只在四叉树中插入一次,无论它占用多少个单元格.但是,对于它所占据的每个单元格,都会插入一个元素节点"以对该元素进行索引.

元素节点是数组中的单链接索引列表节点,也使用上面的自由列表方法.与为叶连续存储所有元素相比,这会导致更多的高速缓存未命中.但是,鉴于此四叉树用于非常动态的数据,该数据在每个时间步中都在移动和碰撞,因此通常要花费比为寻找叶元素的完美连续表示所节省的处理时间更多(您必须有效地实现一个变量大小的内存分配器,这确实非常快,这绝非易事.因此,我使用单链索引列表,该索引列表允许使用自由列表恒定时间方法进行分配/取消分配.使用该表示形式时,只需更改几个整数即可将元素从拆分父级转移到新叶子.

SmallList<T>

哦,我要提一下.自然地,如果您不只是为了进行非递归遍历而仅存储临时节点堆栈,就可以进行分配. SmallList<T>vector<T>相似,不同之处在于它不涉及堆分配,除非您向其中插入了128个以上的元素.它类似于C ++标准库中的SBO字符串优化.这是我实现的,并且已经使用了很多年了,对于确保尽可能使用堆栈确实有​​很大帮助.

树表示

以下是四叉树本身的表示形式:

 struct Quadtree
{
    // Stores all the elements in the quadtree.
    FreeList<QuadElt> elts;

    // Stores all the element nodes in the quadtree.
    FreeList<QuadEltNode> elt_nodes;

    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    std::vector<QuadNode> nodes;

    // Stores the quadtree extents.
    QuadCRect root_rect;

    // Stores the first free node in the quadtree to be reclaimed as 4
    // contiguous nodes at once. A value of -1 indicates that the free
    // list is empty, at which point we simply insert 4 nodes to the
    // back of the nodes array.
    int free_node;

    // Stores the maximum depth allowed for the quadtree.
    int max_depth;
};
 

如上所述,我们为根(root_rect)存储一个矩形.所有子矩形都是动态计算的.所有节点以及元素和元素节点(在FreeList<T>中)都连续存储在数组(std::vector<QuadNode>)中.

FreeList<T>

我使用的是FreeList数据结构,该结构基本上是一个数组(和随机访问序列),可让您在恒定时间内从任何位置删除元素(留下的空洞将在恒定时间的后续插入中被回收).这是一个简化的版本,它无需处理非平凡的数据类型(不使用放置新的或手动销毁调用):

 /// Provides an indexed free list with constant-time removals from anywhere
/// in the list without invalidating indices. T must be trivially constructible 
/// and destructible.
template <class T>
class FreeList
{
public:
    /// Creates a new free list.
    FreeList();

    /// Inserts an element to the free list and returns an index to it.
    int insert(const T& element);

    // Removes the nth element from the free list.
    void erase(int n);

    // Removes all elements from the free list.
    void clear();

    // Returns the range of valid indices.
    int range() const;

    // Returns the nth element.
    T& operator[](int n);

    // Returns the nth element.
    const T& operator[](int n) const;

private:
    union FreeElement
    {
        T element;
        int next;
    };
    std::vector<FreeElement> data;
    int first_free;
};

template <class T>
FreeList<T>::FreeList(): first_free(-1)
{
}

template <class T>
int FreeList<T>::insert(const T& element)
{
    if (first_free != -1)
    {
        const int index = first_free;
        first_free = data[first_free].next;
        data[index].element = element;
        return index;
    }
    else
    {
        FreeElement fe;
        fe.element = element;
        data.push_back(fe);
        return static_cast<int>(data.size() - 1);
    }
}

template <class T>
void FreeList<T>::erase(int n)
{
    data[n].next = first_free;
    first_free = n;
}

template <class T>
void FreeList<T>::clear()
{
    data.clear();
    first_free = -1;
}

template <class T>
int FreeList<T>::range() const
{
    return static_cast<int>(data.size());
}

template <class T>
T& FreeList<T>::operator[](int n)
{
    return data[n].element;
}

template <class T>
const T& FreeList<T>::operator[](int n) const
{
    return data[n].element;
}
 

我有一个可以处理非平凡类型并提供迭代器的类,但是涉及更多.如今,无论如何,我倾向于更多地使用琐碎的可构造/可破坏的C样式结构(仅对高级内容使用非琐碎的用户定义类型).

最大树深度

我通过指定允许的最大深度来防止树细分过多.对于快速的模拟,我使用了8.对我来说,这很关键,因为在VFX中,我经常遇到病理情况,包括由艺术家创建的具有很多重合或重叠元素的内容,而这些内容在没有最大树深限制的情况下可以希望它无限地细分.

如果要在允许的最大深度以及允许在拆分为4个子代之前的叶子中存储多少元素的方面实现最佳性能,可以进行一些微调.我倾向于发现在分裂之前,每个节点最多约有8个元素,并且设置了最大深度,这样才能获得最佳结果,以使最小像元大小略大于平均代理的大小(否则可以插入更多的单个代理)成多片叶子.)

冲突和查询

有两种方法可以进行碰撞检测和查询.我经常看到人们这样做:

for each element in scene:
     use quad tree to check for collision against other elements

这非常简单,但是这种方法的问题在于场景中的第一个元素可能与第二个元素在世界上的位置完全不同.结果,我们沿着四叉树走的路径可能是零星的.我们可能会遍历一条叶子的路径,然后想再次沿着第一个元素(例如第50,000个元素)沿着相同的路径走下去.到此时,该路径中涉及的节点可能已经从CPU缓存中清除了.所以我建议这样:

traversed = {}
gather quadtree leaves
for each leaf in leaves:
{
     for each element in leaf:
     {
          if not traversed[element]:
          {
              use quad tree to check for collision against other elements
              traversed[element] = true                  
          }
     }
}

虽然代码很多,并且要求我们保留某种traversed位集或并行数组以避免两次处理元素(因为它们可能插入多个叶子中),但这有助于确保我们降级在整个循环中沿着四叉树的路径相同.这有助于使事情变得更加缓存友好.同样,如果尝试在时间步长中移动元素后,它仍然完全包含在该叶节点中,那么我们甚至不需要从根目录再次进行备份(我们只能检查一个叶)./p>

常见的效率低下:应避免的事情

虽然有很多方法可以给猫剥皮并实现有效的解决方案,但是有一种常见的方法可以实现非常低效的解决方案.在VFX工作期间,我遇到了效率很低四叉树,kd树和八叉树.我们谈论的是千兆位内存的使用,它仅用30秒即可构建一个具有100k三角形的网格,而一个体面的实现应该能够每秒执行数百次相同的操作,而只需花费几兆.有很多人鞭打这些来解决问题,他们虽然是理论上的向导,但对内存效率却不太重视.

因此,我看到的最常见的绝对不可以是在每个树节点中存储一个或多个成熟的容器.用成熟的容器,我的意思是拥有,分配和释放自己的内存的东西,像这样:

 struct Node
{
     ...

     // Stores the elements in the node.
     List<Element> elements;
};
 

List<Element>可能是Python中的列表,Java或C#中的ArrayList,C ++中的std::vector等:某种管理其自身内存/资源的数据结构.

这里的问题是,虽然可以非常有效地实现这种容器来存储大量元素,但是如果实例化它们的全部内容以存储一个元素,则所有语言中的所有 all 效率极低.每个元素很少.原因之一是在这样的单个树节点级别上,容器元数据在内存使用方面趋向于爆发式增长.容器可能需要存储大小,容量,对它分配的数据的指针/引用等,并且全部用于通用目的,因此它可能会使用64位整数来表示大小和容量.结果,仅用于一个空容器的元数据可能是24个字节,已经比我提出的整个节点表示形式大3倍,并且仅用于一个用于在叶中存储元素的空容器.

此外,每个容器通常希望在插入时进行堆/GC分配,或者预先需要更多预分配的内存(此时,仅容器本身可能需要64个字节).因此,要么由于所有分配而变慢(请注意,GC分配最初在诸如JVM之类的某些实现中确实非常快,但这仅适用于最初的突发Eden周期),或者由于节点数量巨大而导致高速缓存未命中如此之大,以至于遍历几乎不适合较低级别的CPU缓存,或者两者兼而有之.

这是非常自然的倾向,具有直觉意义,因为我们在理论上使用诸如元素存储在叶子中"之类的语言来谈论这些结构,这建议在叶子节点中存储元素的容器.不幸的是,就存储器的使用和处理而言,它具有爆炸性的成本.因此,如果希望创建合理有效的东西,请避免这种情况.进行Node共享,并指向(参考)或为整个树而不是为每个单个节点分配和存储的索引存储器.实际上,这些元素不应该存储在叶子中.

元素应存储在中,叶节点应索引指向这些元素.

结论

亲爱的,所以这些是我要做的主要工作,以实现希望达到的体面解决方案.希望对您有所帮助.请注意,对于那些已经实施了至少一次或两次四叉树的人们,我的目标是达到较高的水平.如有任何疑问,请随时射击.

由于这个问题有点笼统,所以我可能会对其进行编辑,并在不封闭的情况下随着时间的推移不断进行调整和扩展(我喜欢这些类型的问题,因为它们为我们提供了一个写关于我们的经历的借口在现场工作,但该网站并不总是喜欢它们).我也希望一些专家 可能会加入我可以学习的替代解决方案,甚至可以用来进一步改进我的解决方案.

对于像这样的极端动态碰撞场景,再次没有四叉树实际上是我最喜欢的数据结构.因此,我可能需要向那些偏爱四叉树并且已经对其进行了调整和调整的人学习一两件事.通常,我使用四叉树来存储不会在每一帧中移动的静态数据,而对于那些四叉树,我使用与上述建议完全不同的表示形式.

I've been working on adding a Quadtree to a program that I'm writing, and I can't help but notice that there are few well explained/performing tutorials for the implementation that I'm looking for.

Specifically, a list of the methods and pseudocode for how to implement them (or just a description of their processes) that are commonly used in a Quadtree (retrieve, insert, remove, etc.) is what I'm looking for, along with maybe some tips to improve performance. This is for collision detection, so it'd be best to be explained with 2d rectangles in mind, as they are the objects that will be stored.

解决方案

1. Efficient Quadtrees

All right, I'll take a shot at this. First a teaser to show the results of what I'll propose involving 20,000 agents (just something I whipped up real quick for this specific question):

The GIF has extremely reduced frame rate and significantly lower res to fit the 2 MB maximum for this site. Here's a video if you want to see the thing at close to full speed: https://streamable.com/3pgmn.

And a GIF with 100k though I had to fiddle with it so much and had to turn off the quadtree lines (didn't seem to want to compress as much with them on) as well as change the way the agents looked to get it to fit in 2 megabytes (I wish making a GIF was as easy as coding a quadtree):

The simulation with 20k agents takes ~3 megabytes of RAM. I can also easily handle 100k smaller agents without sacrificing frame rates, though it leads to a bit of a mess on the screen to the point where you can barely see what's going on as in the GIF above. This is all running in just one thread on my i7 and I'm spending almost half the time according to VTune just drawing this stuff on the screen (just using some basic scalar instructions to plot things one pixel at a time in CPU).

And here's a video with 100,000 agents though it's hard to see what's going on. It's kind of a big video and I couldn't find any decent way to compress it without the whole video turning into a mush (might need to download or cache it first to see it stream at reasonable FPS). With 100k agents the simulation takes around 4.5 megabytes of RAM and the memory use is very stable after running the simulation for about 5 seconds (stops going up or down since it ceases to heap allocate). Same thing in slow motion.

Efficient Quadtree for Collision Detection

All right, so actually quadtrees are not my favorite data structure for this purpose. I tend to prefer a grid hierarchy, like a coarse grid for the world, a finer grid for a region, and an even finer grid for a sub-region (3 fixed levels of dense grids, and no trees involved), with row-based optimizations so that a row that has no entities in it will be deallocated and turned into a null pointer, and likewise completely empty regions or sub-regions turned into nulls. While this simple implementation of the quadtree running in one thread can handle 100k agents on my i7 at 60+ FPS, I've implemented grids that can handle a couple million agents bouncing off each other every frame on older hardware (an i3). Also I always liked how grids made it very easy to predict how much memory they'll require, since they don't subdivide cells. But I'll try to cover how to implement a reasonably efficient quadtree.

Note that I won't go into the full theory of the data structure. I'm assuming that you already know that and are interested in improving performance. I'm also just going into my personal way of tackling this problem which seems to outperform most of the solutions I find online for my cases, but there are many decent ways and these solutions are tailor-fitted to my use cases (very large inputs with everything moving every frame for visual FX in films and television). Other people probably optimize for different use cases. When it comes to spatial indexing structures in particular, I really think the efficiency of the solution tells you more about the implementer than the data structure. Also the same strategies I'll propose to speeding things up also apply in 3 dimensions with octrees.

Node Representation

So first of all, let's cover the node representation:

// Represents a node in the quadtree.
struct QuadNode
{
    // Points to the first child if this node is a branch or the first
    // element if this node is a leaf.
    int32_t first_child;

    // Stores the number of elements in the leaf or -1 if it this node is
    // not a leaf.
    int32_t count;
};

It's a total of 8 bytes, and this is very important as it's a key part of the speed. I actually use a smaller one (6 bytes per node) but I'll leave that as an exercise to the reader.

You can probably do without the count. I include that for pathological cases to avoid linearly traversing the elements and counting them each time a leaf node might split. In most common cases a node shouldn't store that many elements. However, I work in visual FX and the pathological cases aren't necessarily rare. You can encounter artists creating content with a boatload of coincident points, massive polygons that span the entire scene, etc, so I end up storing a count.

Where are the AABBs?

So one of the first things people might be wondering is where the bounding boxes (rectangles) are for the nodes. I don't store them. I compute them on the fly. I'm kinda surprised most people don't do that in the code I've seen. For me, they're only stored with the tree structure (basically just one AABB for the root).

That might seem like it'd be more expensive to be computing these on the fly, but reducing the memory use of a node can proportionally reduce cache misses when you traverse the tree, and those cache miss reductions tend to be more significant than having to do a couple of bitshifts and some additions/subtractions during traversal. Traversal looks like this:

static QuadNodeList find_leaves(const Quadtree& tree, const QuadNodeData& root, const int rect[4])
{
    QuadNodeList leaves, to_process;
    to_process.push_back(root);
    while (to_process.size() > 0)
    {
        const QuadNodeData nd = to_process.pop_back();

        // If this node is a leaf, insert it to the list.
        if (tree.nodes[nd.index].count != -1)
            leaves.push_back(nd);
        else
        {
            // Otherwise push the children that intersect the rectangle.
            const int mx = nd.crect[0], my = nd.crect[1];
            const int hx = nd.crect[2] >> 1, hy = nd.crect[3] >> 1;
            const int fc = tree.nodes[nd.index].first_child;
            const int l = mx-hx, t = my-hx, r = mx+hx, b = my+hy;

            if (rect[1] <= my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,t, hx, hy, fc+0, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,t, hx, hy, fc+1, nd.depth+1));
            }
            if (rect[3] > my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,b, hx, hy, fc+2, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,b, hx, hy, fc+3, nd.depth+1));
            }
        }
    }
    return leaves;
}

Omitting the AABBs is one of the most unusual things I do (I keep looking for other people doing it just to find a peer and fail), but I've measured the before and after and it did reduce times considerably, at least on very large inputs, to compact the quadtree node substantially and just compute AABBs on the fly during traversal. Space and time aren't always diametrically opposed. Sometimes reducing space also means reducing time given how much performance is dominated by the memory hierarchy these days. I've even sped up some real world operations applied on massive inputs by compressing the data to quarter the memory use and decompressing on the fly.

I don't know why many people choose to cache the AABBs: whether it's programming convenience or if it's genuinely faster in their cases. Yet for data structures which split evenly down the center like regular quadtrees and octrees, I'd suggest measuring the impact of omitting the AABBs and computing them on the fly. You might be quite surprised. Of course it makes sense to store AABBs for structures that don't split evenly like Kd-trees and BVHs as well as loose quadtrees.

Floating-Point

I don't use floating-point for spatial indexes and maybe that's why I see improved performance just computing AABBs on the fly with right shifts for division by 2 and so forth. That said, at least SPFP seems really fast nowadays. I don't know since I haven't measured the difference. I just use integers by preference even though I'm generally working with floating-point inputs (mesh vertices, particles, etc). I just end up converting them to integer coordinates for the purpose of partitioning and performing spatial queries. I'm not sure if there's any major speed benefit of doing this anymore. It's just a habit and preference since I find it easier to reason about integers without having to think about denormalized FP and all that.

Centered AABBs

While I only store a bounding box for the root, it helps to use a representation that stores a center and half size for nodes while using a left/top/right/bottom representation for queries to minimize the amount of arithmetic involved.

Contiguous Children

This is likewise key, and if we refer back to the node rep:

struct QuadNode
{
    int32_t first_child;
    ...
};

We don't need to store an array of children because all 4 children are contiguous:

first_child+0 = index to 1st child (TL)
first_child+1 = index to 2nd child (TR)
first_child+2 = index to 3nd child (BL)
first_child+3 = index to 4th child (BR)

This not only significantly reduces cache misses on traversal but also allows us to significantly shrink our nodes which further reduces cache misses, storing only one 32-bit index (4 bytes) instead of an array of 4 (16 bytes).

This does mean that if you need to transfer elements to just a couple of quadrants of a parent when it splits, it must still allocate all 4 child leaves to store elements in just two quadrants while having two empty leaves as children. However, the trade-off is more than worth it performance-wise at least in my use cases, and remember that a node only takes 8 bytes given how much we've compacted it.

When deallocating children, we deallocate all four at a time. I do this in constant-time using an indexed free list, like so:

Except we're pooling out memory chunks containing 4 contiguous elements instead of one at a time. This makes it so we usually don't need to involve any heap allocations or deallocations during the simulation. A group of 4 nodes gets marked as freed indivisibly only to then be reclaimed indivisibly in a subsequent split of another leaf node.

Deferred Cleanup

I don't update the quadtree's structure right away upon removing elements. When I remove an element, I just descend down the tree to the child node(s) it occupies and then remove the element, but I don't bother to do anything more just yet even if the leaves become empty.

Instead I do a deferred cleanup like this:

void Quadtree::cleanup()
{
    // Only process the root if it's not a leaf.
    SmallList<int> to_process;
    if (nodes[0].count == -1)
        to_process.push_back(0);

    while (to_process.size() > 0)
    {
        const int node_index = to_process.pop_back();
        QuadNode& node = nodes[node_index];

        // Loop through the children.
        int num_empty_leaves = 0;
        for (int j=0; j < 4; ++j)
        {
            const int child_index = node.first_child + j;
            const QuadNode& child = nodes[child_index];

            // Increment empty leaf count if the child is an empty 
            // leaf. Otherwise if the child is a branch, add it to
            // the stack to be processed in the next iteration.
            if (child.count == 0)
                ++num_empty_leaves;
            else if (child.count == -1)
                to_process.push_back(child_index);
        }

        // If all the children were empty leaves, remove them and 
        // make this node the new empty leaf.
        if (num_empty_leaves == 4)
        {
            // Push all 4 children to the free list.
            nodes[node.first_child].first_child = free_node;
            free_node = node.first_child;

            // Make this node the new empty leaf.
            node.first_child = -1;
            node.count = 0;
        }
    }
}

This is called at the end of every single frame after moving all the agents. The reason I do this kind of deferred removal of empty leaf nodes in multiple iterations and not all at once in the process of removing a single element is that element A might move to node N2, making N1 empty. However, element B might, in the same frame, move to N1, making it occupied again.

With the deferred cleanup, we can handle such cases without unnecessarily removing children only to add them right back when another element moves into that quadrant.

Moving elements in my case is a straightforward: 1) remove element, 2) move it, 3) reinsert it to the quadtree. After we move all the elements and at the end of a frame (not time step, there could be multiple time steps per frame), the cleanup function above is called to remove the children from a parent which has 4 empty leaves as children, which effectively turns that parent into the new empty leaf which might then be cleaned up in the next frame with a subsequent cleanup call (or not if things get inserted to it or if the empty leaf's siblings are non-empty).

Let's look at the deferred cleanup visually:

Starting with this, let's say we remove some elements from the tree leaving us with 4 empty leaves:

At this point, if we call cleanup, it will remove 4 leaves if it finds 4 empty child leaves and turn the parent into an empty leaf, like so:

Let's say we remove some more elements:

... and then call cleanup again:

If we call it yet again, we end up with this:

... at which point the root itself turns into an empty leaf. However, the cleanup method never removes the root (it only removes children). Again the main point of doing it deferred this way and in multiple steps is to reduce the amount of potential redundant work that could occur per time step (which can be a lot) if we did this all immediately every single time an element is removed from the tree. It also helps to distribute that works across frames to avoid stutters.

TBH, I originally applied this "deferred cleanup" technique in a DOS game I wrote in C out of sheer laziness! I didn't want to bother with descending down the tree, removing elements, and then removing nodes in a bottom-up fashion back then because I originally wrote the tree to favor top-down traversal (not top-down and back up again) and really thought this lazy solution was a productivity compromise (sacrificing optimal performance to get implemented faster). However, many years later, I actually got around to implementing quadtree removal in ways that immediately started removing nodes and, to my surprise, I actually significantly made it slower with more unpredictable and stuttery frame rates. The deferred cleanup, in spite of originally being inspired by my programmer laziness, was actually (and accidentally) a very effective optimization for dynamic scenes.

Singly-Linked Index Lists for Elements

For elements, I use this representation:

// Represents an element in the quadtree.
struct QuadElt
{
    // Stores the ID for the element (can be used to
    // refer to external data).
    int id;

    // Stores the rectangle for the element.
    int x1, y1, x2, y2;
};

// Represents an element node in the quadtree.
struct QuadEltNode
{
    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    int next;

    // Stores the element index.
    int element;
};

I use an "element node" which is separate from "element". An element is only inserted once to the quadtree no matter how many cells it occupies. However, for each cell it occupies, an "element node" is inserted which indexes that element.

The element node is a singly-linked index list node into an array, and also using the free list method above. This incurs some more cache misses over storing all the elements contiguously for a leaf. However, given that this quadtree is for very dynamic data which is moving and colliding every single time step, it generally takes more processing time than it saves to seek out a perfectly contiguous representation for the leaf elements (you would effectively have to implement a variable-sized memory allocator which is really fast, and that's far from an easy thing to do). So I use the singly-linked index list which allows a free list constant-time approach to allocation/deallocation. When you use that representation, you can transfer elements from split parents to new leaves by just changing a few integers.

SmallList<T>

Oh, I should mention this. Naturally it helps if you don't heap allocate just to store a temporary stack of nodes for non-recursive traversal. SmallList<T> is similar to vector<T> except it won't involve a heap allocation until you insert more than 128 elements to it. It's similar to SBO string optimizations in the C++ standard lib. It's something I implemented and have been using for ages and it does help a lot to make sure you use the stack whenever possible.

Tree Representation

Here's the representation of the quadtree itself:

struct Quadtree
{
    // Stores all the elements in the quadtree.
    FreeList<QuadElt> elts;

    // Stores all the element nodes in the quadtree.
    FreeList<QuadEltNode> elt_nodes;

    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    std::vector<QuadNode> nodes;

    // Stores the quadtree extents.
    QuadCRect root_rect;

    // Stores the first free node in the quadtree to be reclaimed as 4
    // contiguous nodes at once. A value of -1 indicates that the free
    // list is empty, at which point we simply insert 4 nodes to the
    // back of the nodes array.
    int free_node;

    // Stores the maximum depth allowed for the quadtree.
    int max_depth;
};

As pointed out above, we store a single rectangle for the root (root_rect). All sub-rects are computed on the fly. All nodes are stored in contiguously in an array (std::vector<QuadNode>) along with the elements and element nodes (in FreeList<T>).

FreeList<T>

I use a FreeList data structure which is basically an array (and random-access sequence) that lets you remove elements from anywhere in constant-time (leaving holes behind which get reclaimed upon subsequent insertions in constant-time). Here's a simplified version which doesn't bother with handling non-trivial data types (doesn't use placement new or manual destruction calls):

/// Provides an indexed free list with constant-time removals from anywhere
/// in the list without invalidating indices. T must be trivially constructible 
/// and destructible.
template <class T>
class FreeList
{
public:
    /// Creates a new free list.
    FreeList();

    /// Inserts an element to the free list and returns an index to it.
    int insert(const T& element);

    // Removes the nth element from the free list.
    void erase(int n);

    // Removes all elements from the free list.
    void clear();

    // Returns the range of valid indices.
    int range() const;

    // Returns the nth element.
    T& operator[](int n);

    // Returns the nth element.
    const T& operator[](int n) const;

private:
    union FreeElement
    {
        T element;
        int next;
    };
    std::vector<FreeElement> data;
    int first_free;
};

template <class T>
FreeList<T>::FreeList(): first_free(-1)
{
}

template <class T>
int FreeList<T>::insert(const T& element)
{
    if (first_free != -1)
    {
        const int index = first_free;
        first_free = data[first_free].next;
        data[index].element = element;
        return index;
    }
    else
    {
        FreeElement fe;
        fe.element = element;
        data.push_back(fe);
        return static_cast<int>(data.size() - 1);
    }
}

template <class T>
void FreeList<T>::erase(int n)
{
    data[n].next = first_free;
    first_free = n;
}

template <class T>
void FreeList<T>::clear()
{
    data.clear();
    first_free = -1;
}

template <class T>
int FreeList<T>::range() const
{
    return static_cast<int>(data.size());
}

template <class T>
T& FreeList<T>::operator[](int n)
{
    return data[n].element;
}

template <class T>
const T& FreeList<T>::operator[](int n) const
{
    return data[n].element;
}

I have one which does work with non-trivial types and provides iterators and so forth but is much more involved. These days I tend to work more with trivially constructible/destructible C-style structs anyway (only using non-trivial user-defined types for high-level stuff).

Maximum Tree Depth

I prevent the tree from subdividing too much by specifying a max depth allowed. For the quick simulation I whipped up, I used 8. For me this is crucial since again, in VFX I encounter pathological cases a lot including content created by artists with lots of coincident or overlapping elements which, without a maximum tree depth limit, could want it to subdivide indefinitely.

There is a bit of fine-tuning if you want optimal performance with respect to max depth allowed and how many elements you allow to be stored in a leaf before it splits into 4 children. I tend to find the optimal results are gained with something around 8 elements max per node before it splits, and a max depth set so that the smallest cell size is a little over the size of your average agent (otherwise more single agents could be inserted into multiple leaves).

Collision and Queries

There are a couple of ways to do the collision detection and queries. I often see people do it like this:

for each element in scene:
     use quad tree to check for collision against other elements

This is very straightforward but the problem with this approach is that the first element in the scene might be in a totally different location in the world from the second. As a result, the paths we take down the quadtree could be totally sporadic. We might traverse one path to a leaf and then want to go down that same path again for the first element as, say, the 50,000th element. By this time the nodes involved in that path may have already been evicted from the CPU cache. So I recommend doing it this way:

traversed = {}
gather quadtree leaves
for each leaf in leaves:
{
     for each element in leaf:
     {
          if not traversed[element]:
          {
              use quad tree to check for collision against other elements
              traversed[element] = true                  
          }
     }
}

While that's quite a bit more code and requires we keep a traversed bitset or parallel array of some sort to avoid processing elements twice (since they may be inserted in more than one leaf), it helps make sure that we descend the same paths down the quadtree throughout the loop. That helps keep things much more cache-friendly. Also if after attempting to move the element in the time step, it's still encompassed entirely in that leaf node, we don't even need to work our way back up again from the root (we can just check that one leaf only).

Common Inefficiencies: Things to Avoid

While there are many ways to skin the cat and achieve an efficient solution, there is a common way to achieve a very inefficient solution. And I've encountered my share of very inefficient quadtrees, kd trees, and octrees in my career working in VFX. We're talking over a gigabyte of memory use just to partition a mesh with 100k triangles while taking 30 secs to build, when a decent implementation should be able to do the same hundreds of times a second and just take a few megs. There are many people whipping these up to solve problems who are theoretical wizards but didn't pay much attention to memory efficiency.

So the absolute most common no-no I see is to store one or more full-blown containers with each tree node. By full-blown container, I mean something that owns and allocates and frees its own memory, like this:

struct Node
{
     ...

     // Stores the elements in the node.
     List<Element> elements;
};

And List<Element> might be a list in Python, an ArrayList in Java or C#, std::vector in C++, etc: some data structure that manages its own memory/resources.

The problem here is that while such containers are very efficiently implemented for storing a large number of elements, all of them in all languages are extremely inefficient if you instantiate a boatload of them only to store a few elements in each one. One of the reasons is that the container metadata tends to be quite explosive in memory usage at such a granular level of a single tree node. A container might need to store a size, capacity, a pointer/reference to data it allocates, etc, and all for a generalized purpose so it might use 64-bit integers for size and capacity. As a result the metadata just for an empty container might be 24 bytes which is already 3 times larger than the entirety of the node representation I proposed, and that's just for an empty container designed to store elements in leaves.

Furthermore each container often wants to either heap/GC-allocate on insertion or require even more preallocated memory in advance (at which point it might take 64 bytes just for the container itself). So that either becomes slow because of all the allocations (note that GC allocations are really fast initially in some implementations like JVM, but that's only for the initial burst Eden cycle) or because we're incurring a boatload of cache misses because the nodes are so huge that barely any fit into the lower levels of the CPU cache on traversal, or both.

Yet this is a very natural inclination and makes intuitive sense since we talk about these structures theoretically using language like, "Elements are stored in the leaves" which suggests storing a container of elements in leaf nodes. Unfortunately it has an explosive cost in terms of memory use and processing. So avoid this if the desire is to create something reasonably efficient. Make the Node share and point to (refer to) or index memory allocated and stored for the entire tree, not for every single individual node. In actuality the elements shouldn't be stored in the leaves.

Elements should be stored in the tree and leaf nodes should index or point to those elements.

Conclusion

Phew, so those are the main things I do to achieve what is hopefully considered a decent-performing solution. I hope that helps. Note that I am aiming this at a somewhat advanced level for people who have already implemented quadtrees at least once or twice. If you have any questions, feel free to shoot.

Since this question is a bit broad, I might come and edit it and keep tweaking and expanding it over time if it doesn't get closed (I love these types of questions since they give us an excuse to write about our experiences working in the field, but the site doesn't always like them). I'm also hoping some experts might jump in with alternative solutions I can learn from and perhaps use to improve mine further.

Again quadtrees aren't actually my favorite data structure for extremely dynamic collision scenarios like this. So I probably have a thing or two to learn from people who do favor quadtrees for this purpose and have been tweaking and tuning them for years. Mostly I use quadtrees for static data that doesn't move around every frame, and for those I use a very different representation from the one proposed above.

这篇关于四叉树用于2D碰撞检测的高效(且经过充分解释)的实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆