查找所有重叠间隔的计数 [英] Find Count of all Overlapping Intervals

查看:91
本文介绍了查找所有重叠间隔的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于所有这样的记录,我都有startTime和endTime:

I have startTime and endTime for all records like this:

{
 startTime : 21345678
 endTime   : 31345678
}

我正在尝试查找所有冲突的数量.例如,如果有两个记录并且它们重叠,则冲突数为1.如果有三个记录,并且其中两个重叠,则冲突为1.如果有三个记录,并且所有三个重叠,则冲突为3,即[(X1, X2), (X1, X3), (X2, X3)]

I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]

作为一种算法,我正在考虑按开始时间对数据进行排序,并针对每个排序的记录检查结束时间,并查找开始时间小于结束时间的记录.这将是O(n 2 )时间.更好的方法是使用间隔树,并将每条记录插入树中,并在发生重叠时查找计数.这将是O(nlgn)时间.

As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.

我没用过mongoDB,所以我可以使用哪种查询来实现这样的目标?

I have not used mongoDB much so what kind of query can I use to achieve something like this?

推荐答案

正如您正确提到的那样,有许多方法具有不同的执行固有的复杂性.这基本上涵盖了它们的完成方式以及您实际实现的方式取决于您最适合的数据和用例.

As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.

可以使用 运算符,允许将pipeline作为表达式自我加入"到同一集合.基本上,这可以再次查询集合,以查找当前文档的starttime或" endtime介于任何其他文档的相同值之间的所有项目,当然不包括原始项目:

The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:

db.getCollection('collection').aggregate([
  { "$lookup": {
    "from": "collection",
    "let": {
      "_id": "$_id",
      "starttime": "$starttime",
      "endtime": "$endtime"
    },
    "pipeline": [
      { "$match": {
        "$expr": {
          "$and": [
            { "$ne": [ "$$_id", "$_id" },
            { "$or": [
              { "$and": [
                { "$gte": [ "$$starttime", "$starttime" ] },
                { "$lte": [ "$$starttime", "$endtime" ] }
              ]},
              { "$and": [
                { "$gte": [ "$$endtime", "$starttime" ] },
                { "$lte": [ "$$endtime", "$endtime" ] }
              ]}
            ]},
          ]
        },
        "as": "overlaps"
      }},
      { "$count": "count" },
    ]
  }},
  { "$match": { "overlaps.0": { "$exists": true }  } }
])

单个 $lookup 执行加入"在同一集合上,您可以通过管道阶段的"let"选项分别保留"_id""starttime""endtime"值的当前文档"值.这些将在表达式的后续"pipeline"中使用$$前缀作为局部变量"提供.

The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.

在此子管道"中,您使用 $match 管道阶段和 $expr 查询运算符,它允许您评估聚合框架逻辑表达式作为查询条件的一部分.这样可以在选择符合条件的新文档时进行值之间的比较.

Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.

条件仅查找已处理文档",其中"_id"字段不等于当前文档", $or "endtime"值属于已处理文档"的相同属性之间.请注意,这些以及相应的 $gte $lte 运算符是,而不是形式,其返回结果由

The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime" $or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.

由于我们只希望匹配的计数",因此 $count 流水线阶段用于执行此操作.整体 $lookup 的结果将是单个包含计数的元素"数组,或与条件不匹配的空数组".

Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.

另一种情况是忽略" $count 阶段,只需允许匹配的文档返回即可.这样可以轻松识别,但是作为文档中嵌入的数组",您确实需要注意将作为整个文档返回的重叠"的数量,并且这不会违反16MB的BSON限制.在大多数情况下,这应该很好,但是对于期望给定文档有大量重叠的情况,这可能是真实的情况.因此,实际上还有更多需要注意的地方.

An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.

在这种情况下, $lookup 管道阶段将总是"返回结果数组,即使为空.如"as"属性中所指定,输出属性合并"到现有文档中的名称将为"overlaps". /lookup/"rel =" nofollow noreferrer> $lookup 阶段.

The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.

$lookup 之后,我们可以做一个简单的 $match ,其中一个常规查询表达式使用 $exists 测试输出数组的0索引值.数组中实际上有一些内容,因此重叠"的条件为真,并返回文档,根据您的选择显示计数或文档重叠".

Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.

您的MongoDB缺乏此支持的另一种情况是,通过对每个检查的文档发出上面概述的相同查询条件,手动加入":

The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:

db.getCollection('collection').find().map( d => {
  var overlaps = db.getCollection('collection').find({
    "_id": { "$ne": d._id },
    "$or": [
      { "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
      { "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
    ]
  }).toArray();

  return ( overlaps.length !== 0 ) 
    ? Object.assign(
        d,
        {
          "overlaps": {
            "count": overlaps.length,
            "documents": overlaps
          }
        }
      )
    : null;
}).filter(e => e != null);

这实际上是相同的逻辑,除了我们实际上需要返回数据库"以发出查询以匹配重叠的文档.这次是查询运算符",用于查找当前文档值在已处理文档的值之间.

This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.

因为已经从服务器返回了结果,所以在将内容添加到输出时没有BSON限制限制.您可能有内存限制,但这是另一个问题.简而言之,我们通过.toArray()返回数组而不是光标,因此我们拥有匹配的文档,并且可以简单地访问数组长度以获得计数.如果您实际上不需要文档,则使用 .count() 代替.find()效率更高,因为没有文档获取开销.

Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.

然后将输出简单地与现有文档合并,另一个重要的区别是,由于这些是多个查询",因此无法提供它们必须匹配"某些条件的条件.因此,这使我们考虑到将存在计数(或数组长度)为0的结果,此时我们所能做的就是返回一个null值,以后可以从结果数组中返回.filter().迭代游标的其他方法在不需要的地方采用了相同的丢弃"结果的基本原理.但是没有什么可以阻止查询在服务器上运行,并且这种过滤是某种形式的后处理".

The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.

因此,上述方法适用于上述结构,但是总的来说,复杂性要求每个文档必须从本质上检查集合中的所有其他文档以查找重叠.因此,在使用 $lookup 允许使用某些"效率" 减少了传输和响应开销,但仍然遇到了一个问题,即您实际上仍在将每个文档与所有文档进行比较.

So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.

更好的解决方案在适当的地方可以使之适合" 是存储代表每个文档间隔的硬值" *.例如,我们可以假设"一天中有一个小时的可靠预订"期间,总共有24个预订期间.此可以"表示为类似的内容:

A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:

{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }

通过这样的数据组织方式,其中有一个设置间隔的指标,因此复杂度大大降低了,因为这实际上是对"booking"属性中数组的间隔值进行分组"的问题:

With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:

db.booking.aggregate([
  { "$unwind": "$booking" },
  { "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
  { "$match": { "docs.1": { "$exists": true } } }
])

输出:

{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }

这正确地标识了对于1011间隔,"A""D"都包含重叠,而"B""A"12上重叠.通过相同的 $exists 排除其他间隔和文档匹配除了这次对1索引(或存在第二个数组元素)进行测试,以查看分组中是否存在多个"文档,从而表明存在重叠.

That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.

这只是采用了 $unwind 聚合管道阶段以解构/非规范化"数组内容,以便我们可以访问内部值进行分组.这正是 $group 阶段发生的情况提供的关键字"是预订间隔ID和 $push 运算符用于收集"有关在该组中找到的当前文档的数据. $match 如前所述.

This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.

这甚至可以扩展为备用演示文稿:

This can even be expanded for alternate presentation:

db.booking.aggregate([
  { "$unwind": "$booking" },
  { "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
  { "$match": { "docs.1": { "$exists": true } } },
  { "$unwind": "$docs" },
  { "$group": {
    "_id": "$docs",
    "intervals": { "$push": "$_id" }  
  }}
])

有输出:

{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }

这是一个简化的演示,但是如果您拥有的数据可以用于进行所需的分析,那么这将是效率更高的方法.因此,如果您可以将粒度"固定为固定在每个文档中通常可以记录的设定"间隔,则分析和报告可以使用后一种方法快速有效地识别此类重叠.

It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.

从本质上讲,这就是您无论如何都要实现基本提到的更好"方法的方法,首先是对您最初的理论进行轻微"改进.看看哪一个真正适合您的情况,但这应该解释实现和区别.

Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

这篇关于查找所有重叠间隔的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆