计算一个十亿的数字位数 [英] Calculate the median of a billion numbers

查看:192
本文介绍了计算一个十亿的数字位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果你有一个十亿数到一百台计算机,什么是找到这些数字的中间值的最佳方式是什么?

If you have one billion numbers and one hundred computers, what is the best way to locate the median of these numbers?

一个解决方案,我有是:

One solution which I have is:

  • 斯普利特同样设置在计算机之间。
  • 进行排序。
  • 找到中位数为每套。
  • 排序集的中位数。
  • 在合并两套从最低到最高位的时候。

如果我们有 M1< m 2的M3 ... 则先合并设置1 设定2 ,并在结果集,我们就可以放弃所有的比 Set12 (合并)中值较低的数字。因此,在任何时候,我们有相同大小的套。顺便说这不能以并行的方式进行。任何想法?

If we have m1 < m2 < m3 ... then first merge Set1 and Set2 and in the resulting set we can discard all the numbers lower than the median of Set12 (merged). So at any point of time we have equal sized sets. By the way this cannot be done in a parallel manner. Any ideas?

推荐答案

嗯,我的大脑刚刚踢入档,我有一个明智的建议吧。也许为时已晚,如果这已经接受记者采访时,却从不记:

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

机1定名为控制机,并且为了论证要么所有的数据开始,并把它等于包裹其他99机器,否则数据开始均匀的机器之间进行分配,并把它送到其数据1/99至每个其他。这些分区不必须是平等的,只是接近。

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

每个其它机器排序的数据,而这样做的方式,这有利于先查找较低值。因此,例如一个快速排序,总是首先分拣分区的下部[*]。它一旦写入其数据传回控制机按升序排列,因为它可以(使用异步IO,以继续整理,可能与内格尔:实验了一下)。

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

,控制设备执行一个99路合并上的数据,因为它到达,但是丢弃该合并的数据,只是保持它已经看到的值的数量的计数。它计算中位数为1/2十亿的1/2十亿的平均值和加oneth值。

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

该遭受的最慢的牛群的问题。该算法不能完成直到每值小于中值已发送一个分选机。有一个合理的机会,一个这样的价值将其包裹数据在相当高的。所以一旦数据的初始划分完成时,估计运行时间是时间的组合进行排序数据的1 /第99和其发送回控制计算机,且时间为控制读取1/2数据。所述结合是介于最大和那些时间的总和,可能接近最大值之间

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

我的直觉是,用于发送数据在网络上比排序速度更快(更别说只是选择的中位数),它必须是一个pretty的该死的快速网络。可能是一个更好的前景,如果网络可以是presumed是瞬时的,例如,如果你有100个内核和平等地获得RAM包含数据。

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

由于网络I / O很可能是绑定的,可能有一些技巧,你可以玩,至少回来的控制机中的数据。例如,而不是发送1,2,3,...... 100,也许是分拣机可以发送一条消息,意为100值小于101。控制机便可以进行修改合并,其中发现至少所有这些顶级的一个范围值,然后告诉所有的分拣机是什么东西,让他们可以(一)告诉控制机如何许多值算低于该值,和(b)继续从该点送他们排序的数据。

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

更普遍,有可能是一个聪明的挑战 - 响应猜谜游戏,控制机可与99分拣机播放。

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

这涉及到机器之间的往返,虽然,这我简单的第一个版本可避免。我真的不知道该怎么盲估计他们的相对表现,而且由于权衡是复杂的,我想有更好的解决方案在那里比什么都重要,我会想到我自己,假设这是有史以来一个真正的问题。

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*]可用的堆栈许可 - 你选择哪部分,首先要做的是有限的,如果你没有O(N)的额外空间。但是,如果你有足够的额外空间,你可以把你挑,如果你没有足够的空间,你至少可以用什么你就不得不削减一些角落,首先做了小部分的前几个分区。

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.

这篇关于计算一个十亿的数字位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆