查找所有重叠区间的计数 [英] Find Count of all Overlapping Intervals

查看:31
本文介绍了查找所有重叠区间的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对所有这样的记录都有 startTime 和 endTime:

I have startTime and endTime for all records like this:

{
 startTime : 21345678
 endTime   : 31345678
}

我试图找出所有冲突的数量.例如,如果有两条记录并且它们重叠,则冲突数为 1.如果有 3 条记录并且其中两条重叠,则冲突为 1.如果有 3 条记录且所有 3 条重叠,则冲突为 3,即 [(X1, X2), (X1, X3), (X2, X3)]

I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]

作为一种算法,我正在考虑按开始时间对数据进行排序,并为每个排序的记录检查结束时间并找到开始时间小于结束时间的记录.这将是 O(n2) 时间.更好的方法是使用区间树并将每条记录插入树中,并在发生重叠时查找计数.这将是 O(nlgn) 时间.

As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.

我没有经常使用 mongoDB,所以我可以使用什么样的查询来实现这样的事情?

I have not used mongoDB much so what kind of query can I use to achieve something like this?

推荐答案

正如你正确提到的,有不同的方法,它们的执行固有的复杂性不同.这基本上涵盖了它们是如何完成的,以及您实际实施哪一个取决于您的数据和用例最适合哪个.

As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.

使用 $lookup 运算符允许将 pipeline 作为自连接"的表达式给出到同一个集合.这基本上可以再次查询集合中 starttime 或"的任何项目.当前文档的 endtime 介于任何其他文档的相同值之间,当然不包括原始文档:

The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:

db.getCollection('collection').aggregate([
  { "$lookup": {
    "from": "collection",
    "let": {
      "_id": "$_id",
      "starttime": "$starttime",
      "endtime": "$endtime"
    },
    "pipeline": [
      { "$match": {
        "$expr": {
          "$and": [
            { "$ne": [ "$$_id", "$_id" },
            { "$or": [
              { "$and": [
                { "$gte": [ "$$starttime", "$starttime" ] },
                { "$lte": [ "$$starttime", "$endtime" ] }
              ]},
              { "$and": [
                { "$gte": [ "$$endtime", "$starttime" ] },
                { "$lte": [ "$$endtime", "$endtime" ] }
              ]}
            ]},
          ]
        },
        "as": "overlaps"
      }},
      { "$count": "count" },
    ]
  }},
  { "$match": { "overlaps.0": { "$exists": true }  } }
])

$lookup 执行加入";在同一个集合上,允许您保留当前文档";_id"starttime"endtime" 值分别通过 let" 管道阶段的选项.这些将作为局部变量"提供.在表达式的后续 管道" 中使用 $$ 前缀.

The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.

在这个子管道"内您使用 $match 管道阶段和 $expr 查询运算符,它允许您评估聚合框架逻辑表达式作为查询条件的一部分.这允许在选择符合条件的新文档时比较值.

Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.

条件只是查找已处理的文档";其中 _id" 字段不等于当前文档",$and 其中 "starttime"$or "当前文档"的结束时间"值落在已处理文档"的相同属性之间.在这里注意这些以及各自的 $gte$lte 运算符是 聚合比较运算符"而不是 "query operator" 形式,如$expr 在上下文中必须是 boolean.这就是聚合比较运算符的实际作用,也是传入比较值的唯一方法.

The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime" $or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.

因为我们只想要计数";在匹配项中,$count 管道阶段用于执行此操作.整体$lookup 将是一个单一元素";有计数的数组,或空数组";条件不匹配的地方.

Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.

另一种情况是省略"$count 阶段允许匹配的文档返回.这允许容易识别,但作为嵌入在文档中的阵列"您确实需要注意重叠"的数量;这将作为整个文档返回,并且这不会导致违反 16MB 的 BSON 限制.在大多数情况下,这应该没问题,但是对于您期望给定文档有大量重叠的情况,这可能是一个真实的情况.所以这确实是需要注意的事情.

An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.

$lookup在这种情况下,流水线阶段将总是"在结果中返回一个数组,即使是空的.输出属性的名称merging"将 overlaps" 添加到现有文档中,如 属性中指定的那样manual/reference/operator/aggregation/lookup/" rel="nofollow noreferrer">$lookup 阶段.

The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.

遵循 $lookup,然后我们可以做一个简单的 $match 使用 $exists<的正则查询表达式/code> 测试输出数组的 0 索引值.数组中实际上存在一些内容并因此重叠"的地方?条件为真并返回文档,显示计数或文档重叠";根据您的选择.

Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.

您的 MongoDB 缺乏这种支持的另一种情况是加入".手动为每个检查的文档发出上述相同的查询条件:

The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:

db.getCollection('collection').find().map( d => {
  var overlaps = db.getCollection('collection').find({
    "_id": { "$ne": d._id },
    "$or": [
      { "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
      { "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
    ]
  }).toArray();

  return ( overlaps.length !== 0 ) 
    ? Object.assign(
        d,
        {
          "overlaps": {
            "count": overlaps.length,
            "documents": overlaps
          }
        }
      )
    : null;
}).filter(e => e != null);

这本质上是相同的逻辑,只是我们实际上需要返回数据库".为了发出查询以匹配重叠的文档.这次是查询运算符".用于查找当前文档值介于已处理文档的值之间的位置.

This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.

因为结果已经从服务器返回,所以在输出中添加内容没有 BSON 限制.您可能有内存限制,但这是另一个问题.简单地说,我们通过 .toArray() 返回数组而不是光标,这样我们就有了匹配的文档,并且可以简单地访问数组长度来获得计数.如果您实际上并不需要这些文件,那么使用 .count() 而不是 .find() 效率更高,因为没有文档获取开销.

Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.

然后将输出简单地与现有文档合并,其中另一个重要区别是,由于这些是多个查询",因此可以将其与现有文档合并.没有办法提供它们必须匹配"的条件.某物.所以这让我们不得不考虑计数(或数组长度)为 0 的结果,此时我们所能做的就是返回一个 null 值,我们可以稍后从结果数组中获取 .filter().其他迭代光标的方法采用相同的丢弃"光标的基本原理.结果在我们不想要的地方.但是没有什么能阻止查询在服务器上运行,并且这种过滤是后处理".以某种形式.

The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.

因此,上述方法适用于所描述的结构,但当然,总体复杂性要求对于每个文档,您必须基本上检查集合中的所有其他文档以查找重叠.因此,虽然使用 $lookup 允许对于减少传输和响应开销方面的一些效率",它仍然遇到同样的问题,即您仍然基本上将每个文档与所有内容进行比较.

So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.

一个更好的解决方案你可以让它适合的地方"是代替存储一个硬值"*代表每个文档上的间隔.例如,我们可以假设"有可靠的预订";一天内一小时的时段,总共 24 个预订时段.这可能"表示为:

A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:

{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }

使用像这样组织的数据,其中有一个设定的间隔指标,复杂性大大降低,因为它实际上只是分组"的问题.关于 booking" 属性中数组的间隔值:

With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:

db.booking.aggregate([
  { "$unwind": "$booking" },
  { "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
  { "$match": { "docs.1": { "$exists": true } } }
])

还有输出:

{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }

正确识别 1011 区间 A"D" 包含重叠,而 B"A"12 上重叠.其他区间和​​文档匹配通过相同的 $exists 测试,但这次在 1 索引(或存在的第二个数组元素)上进行测试,以查看是否存在多个".分组中的文档,因此表明存在重叠.

That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.

这只是使用 $unwind聚合管道阶段解构/非规范化";数组内容,以便我们可以访问内部值进行分组.这正是 $group key"所在的阶段提供的是预订间隔 id 和 $push 运算符用于收集"数据.在该组中找到的有关当前文档的数据.$match 解释如下早一点.

This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.

这甚至可以扩展为替代演示:

This can even be expanded for alternate presentation:

db.booking.aggregate([
  { "$unwind": "$booking" },
  { "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
  { "$match": { "docs.1": { "$exists": true } } },
  { "$unwind": "$docs" },
  { "$group": {
    "_id": "$docs",
    "intervals": { "$push": "$_id" }  
  }}
])

有输出:

{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }

这是一个简化的演示,但如果您拥有的数据允许它进行所需的分析,那么这是更有效的方法.因此,如果您可以保持粒度";固定为设置";可以在每个文档上共同记录的间隔,然后分析和报告可以使用后一种方法来快速有效地识别这种重叠.

It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.

本质上,这就是您将如何实现您基本上提到的更好"的方法.无论如何都要接近,第一个是轻微的"对你最初理论的改进.看看哪一个真正适合您的情况,但这应该解释实现和差异.

Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

这篇关于查找所有重叠区间的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆