查找所有在线性时间内出现次数超过 n/4 次的元素 [英] Find all elements that appear more than n/4 times in linear time

查看:17
本文介绍了查找所有在线性时间内出现次数超过 n/4 次的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是Skiena的4-11.找到多数元素的解决方案 - 重复超过半次是多数算法.我们可以用它来找出所有重复 n/4 次的数字吗?

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?

推荐答案

Misra 和 Gries 描述了几种方法.我不完全理解他们的论文,但一个关键的想法是使用袋子.

Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.

Boyer 和 Moore 的原始多数算法论文有关于 FORTRAN 代码的形式验证的许多难以理解的证明和讨论,但它有一个很好的开始解释多数算法的工作原理.关键概念始于这样一个想法,如果大多数元素是 A 并且您一次删除一个 A 的副本和其他内容的副本,那么最后你将只有 A 的副本.接下来,应该清楚的是,删除两个不同的项目,都不是A,只能增加A 持有的多数.因此,删除任何 对项目都是安全的,只要它们不同即可.然后可以将这个想法具体化.从列表中取出第一项并将其粘贴在一个盒子中.取出下一件物品并将其贴在盒子里.如果他们是一样的,让他们都坐在那里.如果新的不同,把它连同盒子里的一件东西一起扔掉.重复直到所有物品都在盒子里或垃圾桶里.由于盒子一次只允许有一种物品,它可以非常有效地表示为一对(item type, count).

Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).

找出所有可能出现次数超过 n/k 次的项目的概括很简单,但解释它为什么起作用有点困难.基本思想是我们可以在不改变任何东西的情况下找到并销毁 k distinct 元素的组.为什么?如果 w >n/k 然后 w-1 >(n-k)/k.也就是说,如果我们去掉其中一个流行元素,同时去掉k-1 other元素,那么这个流行元素依然流行!

The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!

实现:不是只允许 一种 类型的物品在框中,而是允许 k-1 种物品.每当您看到一组 k 不同 项出现时(即,框中有 k-1 种类型,到达的那个不匹配任何一种),您将每种类型中的一种扔进垃圾箱,包括刚刚到达的那种.我们应该为这个盒子"使用什么数据结构?嗯,当然是包!正如 Misra 和 Gries 所解释的那样,如果元素可以排序,那么具有 O(log k) 基本操作的基于树的袋子将使整个算法的复杂度为 O(n log k).需要注意的一点是,删除每个元素中的一个的操作是昂贵的(我认为 O(k log k)),但是该成本在这些元素的到达时分摊,因此没什么大不了的.当然,如果您的元素是可散列的而不是可排序的,您可以改用基于散列的包,在某些常见假设下,它会提供更好的渐近性能(但不能保证).如果您的元素是从一个小的有限集合中提取的,您可以保证这一点.如果它们只能比较相等,那么你的包会变得更贵,我很确定你最终会得到类似 O(nk) 的东西.

Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is expensive (I think O(k log k)), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.

这篇关于查找所有在线性时间内出现次数超过 n/4 次的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆