对M个已排序集合的并集的前N个项进行排序的最有效方法是什么 [英] What is the most efficient way to ge the top N items of the union of M sorted sets

查看:100
本文介绍了对M个已排序集合的并集的前N个项进行排序的最有效方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有4种排序的集合,其中包含成千上万的键和分数。由于它们是排序的集合,因此可以以对数时间复杂度的方式获取顶级项目。



最简单的方法是将集合合并,然后得到头等物品。但这是至少与所有集合中所有项目的总和呈线性关系。



我能想到的最好的方法是:


  1. 从每个集合中获取前N个项

  2. 查找排名最低和最高的项

  3. 将该分数除以套数即可。 (分数低于此值的任何键都永远不能位于前N位)

  4. 采用这些键的并集。 (忽略分数)

  5. 查找所有集合中所有键的分数。 (一个键可能在一组中得分为1,而另一组为10000)

这就像查找所有可能在其中的键一样顶部列表,并使用这些键进行并集。可能有更有效的方法来限制要考虑的项数。



[edit]
键出现在一组或多组中,它们的总分决定最终分数。
因此,所有分数较低的键集中的键可能比仅分数较高的键具有更高的得分。

解决方案

您提出的算法似乎很尴尬。只需采取以下其中一项:



简单方法



  i = 1至n 
遍历所有集合并查看其最小元素,
选择最小元素并将其从集合


复杂度:
O(n * s)其中n是您想要的项目数,s是套数。



当然,如果不允许您从集合中删除元素,则还可以在每个集合中维护 iterators 以便从中获取排序的元素



一种更有效的方法



在所有最小的队列上维护优先级队列每套元素。每当从该优先级队列中删除最小的元素 e 时,请重新插入来自 e 的集合中的下一个元素。 / p>

复杂度:假设一个简单的优先级队列带有 O(log n)'插入'和 O(log n)删除最小元素的复杂性。有更好的像斐波那契堆,但是这将很好。然后我们有:




  • s 插入内容可在开始时填充优先级队列,因此 O(s log s)

  • n 删除最小元素 +插入一个新元素,因此 O(n log s)(因为队列中始终有 s 个元素)



因此,我们达到 O(s log s + n log s)

比较



只要 s 很小,算法之间应该没有太大的区别,您也可以选择简单的算法。如果您有很多集合,那么您绝对应该选择第二种方法。



查找复杂度



在我的分析中,我省略了对数查找因子以查找每个集合的最小元素,并假设可以在 O(1)中检索每个集合的最小元素,例如在排序列表中。将查找成本从 O(1)更改为 O(log n)只是引入了一个不变的附加因素算法。此外,通常,您在第一次查找时只需支付一次 O(log n)。之后,通常需要对最小元素进行迭代。这样,仅使用 O(1)即可使用迭代器访问每个其他元素。


Say you have 4 sorted sets with thousands and thousands of keys and scores. Since they are sorted sets, getting the top items can ben done in logaritmic time complexity.

The easy way would be to take the union of the sets, and then get the top items. But doing so is at least linear to the sum of all items in all sets.

The best way I could think of is this:

  1. Take the top N items from every set
  2. Find the item with the lowest rank and the higest score for that rank.
  3. Devide that score by the number of sets. (Any key with a score lower than this can never be in the top N)
  4. Take the union of those keys. (Ignoring scores)
  5. Find the scores for all keys in all sets. (A key might have score 1 in one set and 10000 in another)

That is like, finding all keys that could possibly be in the top list, and do the union with those keys. There are probably more efficient ways to limit the number of items to consider.

[edit] Keys occur in one or more sets, and their summed scores determines the final score. So a key that is in all sets with a low score might have a higher score than a key with a high score that is in only one set.

解决方案

The algorithm you propose seems quite awkward. Just take one of the following:

The simple way

for i = 1 to n
    loop through all sets and look at their smallest element,
    pick the smallest element and remove it from the sets

Complexity: O(n * s) where n is the number of items you want and s is the number of sets.

Of course, if you are not allowed to remove elements from the sets, you can also maintain iterators into each set to get elements from them in sorted order without having to alter the sets.

A more efficient way

Maintain a priority queue over all the smallest elements of each set. Whenever removing the smallest element e from that priority queue, reinsert the next element from the set from which e came.

Complexity: Assume a simple priority queue with O(log n) 'insert' and O(log n) 'remove smallest element' complexity. There are better ones like fibonacci heaps, but this one will do just fine. Then we have:

  • s insertions to fill the priority queue at the start, so O(s log s).
  • n "delete smallest element" + insert a new one, so O(n log s) (since there are always s elements in the queue)

Thus, we achieve O(s log s + n log s) which is way better.

Comparison

As long as s is quite small, there shouldn't really be a big difference between the algorithms and you can also pick the simple one. If you have a lot of sets, then you should definitely go for the second approach.

Lookup Complexity

In my analysis, I omitted the logarithmic lookup factor to find the smallest element for each set and assumed that the smallest element of each set could be retrieved in O(1), like in a sorted list. Varying the lookup cost from O(1) to O(log n) just introduces an additional factor that does not alter the algorithms. In addition, you usueally only pay the O(log n) once at the first lookup. Afterwards, you usually have an iterator to the smallest element. Accessing each further element using the iterator is then only O(1).

这篇关于对M个已排序集合的并集的前N个项进行排序的最有效方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆