在 Perl 中确定范围重叠的最快方法 [英] Quickest way to determine range overlap in Perl

查看:24
本文介绍了在 Perl 中确定范围重叠的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两组范围.每个范围都是一对整数(开始和结束),表示单个较大范围的某个子范围.两组范围的结构与此类似(当然 ...s 将替换为实际数字).

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).

$a_ranges =
{
  a_1 =>
  {
    start => ...,
    end   => ...,
  },
  a_2 =>
  {
    start => ...,
    end   => ...,
  },
  a_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

$b_ranges =
{
  b_1 =>
  {
    start => ...,
    end   => ...,
  },
  b_2 =>
  {
    start => ...,
    end   => ...,
  },
  b_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

我需要确定集合 A 中的哪些范围与集合 B 中的哪些范围重叠.给定两个范围,很容易确定它们是否重叠.我只是使用双循环来执行此操作——在外循环中遍历集合 A 中的所有元素,在内循环中遍历集合 B 中的所有元素,并跟踪哪些元素重叠.

I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.

这种方法有两个问题.首先是重叠空间非常稀疏——即使每个集合中有数千个范围,我希望集合 A 中的每个范围都与集合 B 中的 1 或 2 个范围重叠.我的方法列举了每一种可能性,即矫枉过正.这导致了我的第二个问题——它的扩展性非常差.当每组有数百个范围时,代码会很快完成(不到一分钟),但当每组有数千个范围时,代码需要很长时间(+/- 30 分钟).

I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.

有没有更好的方法可以索引这些范围,这样我就不会做太多不必要的重叠检查?

Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?

更新:我正在寻找的输出是两个散列(每个范围集一个),其中键是范围 ID,值是来自另一组的范围的 ID与此集合中的给定范围重叠.

Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.

推荐答案

这听起来像是 区间树,这是一种专门为支持这种操作而设计的数据结构.如果您有两组大小为 m 和 n 的区间,那么您可以在时间 O(m lg m) 内将其中一组构建成区间树,然后在时间 O(n lg m + k) 内进行 n 次交集查询,其中k 是您找到的交叉点的总数.这给出了 O((m + n) lg m + k) 的净运行时间.请记住,在最坏的情况下 k = O(nm),所以这并不比你所拥有的更好,但是对于交叉点数量稀疏的情况,这可能比你正确的 O(mn) 运行时要好得多现在.

This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.

我没有太多使用区间树的经验(在 Perl 中零经验,抱歉!),但从描述看来,它们不应该那么难构建.如果一个还不存在,我会感到非常惊讶.

I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.

希望这会有所帮助!

这篇关于在 Perl 中确定范围重叠的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆