查找线性时间中出现次数超过n / 4次的所有元素 [英] Find all elements that appear more than n/4 times in linear time

查看:87
本文介绍了查找线性时间中出现次数超过n / 4次的所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是Skiena的4-11。查找多数元素的解决方案-重复超过一半以上是多数算法。我们可以使用它来查找重复n / 4次的所有数字吗?

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?

推荐答案

Misra和Gries 描述了几种方法。我不完全理解他们的论文,但是一个主要的想法是用皮包

Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.

Boyer和Moore最初的多数算法论文有很多令人费解的证据,并讨论了FORTRAN代码的形式验证,但是它很好地解释了多数算法的工作原理。关键概念始于以下想法:如果大多数元素是 A 并且您一次删除了一个 A 和其他副本,那么最终您将只有 A 的副本。接下来,应该清楚的是,删除两个都不是 A 的项目只能增加 A 持有。因此,只要任何对项目都不同,就可以安全地删除它们。然后可以将这个想法具体化。从列表中取出第一项并将其粘贴在一个盒子中。取出下一个项目并将其粘贴在框中。如果他们相同,让他们两个都坐在那里。如果新的是 different ,请将其与包装盒中的物品一起扔掉。重复上述操作,直到所有项目都在包装箱中或垃圾箱中。由于该框一次只能容纳一种物品,因此可以非常有效地将其表示为一对(物品类型,数量)

Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).

找到所有可能出现次数超过 n / k 次的项很简单,但是要解释为什么会更难。基本思想是我们可以查找并销毁 k distinct 元素组,而无需进行任何更改。为什么?如果 w> n / k ,然后 w-1> (n-k)/ k 。也就是说,如果我们删除流行元素之一,并且还删除 k-1 other 元素,那么流行元素仍然受欢迎!

The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!

实现:允许只允许 k-1 个。每当您看到出现一组 k 不同项时(即,有 k-1 的类型,而到达的类型与它们中的任何一个都不匹配),则将每种类型的一种(包括刚到达的一种)扔到垃圾箱中。我们应该为盒子使用什么数据结构?好吧,一个袋子,当然!正如Misra和Gries解释的那样,如果元素可以排序,则具有O(log k)基本操作的基于树的包将使整个算法的复杂度为O(n log k)。需要注意的一点是,删除每个元素之一的操作非常昂贵(我认为O(k log k)),但是费用会随着这些元素的到达而摊销,因此没什么大不了的。当然,如果您的元素是可哈希的,而不是可排序的,则可以使用基于哈希的包,这在某些常见的假设下会提供更好的渐近性能(但不能保证)。如果您的元素是从一个小的有限集中绘制的,则可以保证。如果只能比较它们的平等程度,那么您的书包会贵很多,我很确定您最终会得到类似O(nk)的东西。

Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is expensive (I think O(k log k)), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.

这篇关于查找线性时间中出现次数超过n / 4次的所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆