找到一个非常大的文件的K-最大的元素(而k是非常大) [英] Finding k-largest elements of a very large file (while k is very LARGE)

查看:121
本文介绍了找到一个非常大的文件的K-最大的元素(而k是非常大)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个非常大的文件,其中包含数十亿美元整数的,我们希望找到 K 这些值最大的元素,

Let's assume that we have a very large file which contains billions of integers , and we want to find k largest elements of these values ,

棘手的部分是, K 本身是非常大太,这意味着我们不能让 K 中的元素存储器(例如,我们有100陀飞轮元素的文件,我们要搜寻到10十亿最大的元素)

the tricky part is that k itself is very large too , which means we cannot keep k elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)

我们如何才能做到这一点 O(N)

How can we do this in O(n) ?

我的想法:

我们开始读取文件,我们检查它与它保持 K 最大元素(排序增大的顺序)另一个文件,如果该读元件比第一大第二个文件的行,我们删除了第一线,我们将其插入到第二个文件,时间复杂度将是 O(NlogK)(如果我们随机访问了文件,否则这将是O(NK)

We start reading the file and we check it with another file which keeps the k largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK) (if we have random access to that file , otherwise it would be 'O(Nk)'

任何想法,为此在 O(N),我想,如果我们有选择算法的外部版本 (在快速排序的分区算法),我们将能够做到这一点的 O(N),但我找不到它的任何地方。

Any idea to do this in O(n) , I guess if we have external version of Selection algorithm (the partitioning algorithm in quicksort) we would be able to do this in O(n) but I couldn't find it anywhere

推荐答案

PS:我的K的定义是不同的。这是一个短小数说2或100或1000。在这里,M对应于k的OPS的定义。遗憾。

PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.

取决于有多少读取,你可以做的原始数据和多少空间,你有。这种方法假设你必须等同于原始数据的额外空间。

Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.

第1步:选择ķ随机数在整个数据
第2步:排序K个(假设指数从1到K) 第3步:创建K + 1独立的文件并将其命名为0到K
步骤4:对于在数据的每个元素,如果是第i之间和i + th元素把它在第i个文件
。 步骤5:根据每个文件的大小,选择将有第m号码的文件
。 第6步:重复一切与新文件和新的M(new_m = M - sum_of_size_of_all_lower_files)

Step 1: Pick K random numbers across the whole data
Step 2: Sort the K numbers (assume index are from 1 to K)
Step 3: Create K+1 separate files and name them 0 to K
Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
Step 5: Based on the size of each file, choose the file that is going to have mth number.
Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)

关于最后步骤中,如果K = 2,M = 1000和文件的0是800尺寸,1 900和2 200,new_m =米-800 = 200,并通过文件1迭代地工作。

Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.

这篇关于找到一个非常大的文件的K-最大的元素(而k是非常大)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆