用于检测数据集中因太大而无法完全加载到内存中的重复项的算法 [英] Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

查看:9
本文介绍了用于检测数据集中因太大而无法完全加载到内存中的重复项的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题有最优解吗?

描述一种在包含一百万个电话号码的文件中查找重复项的算法.该算法在运行时只有 2 MB 可用内存,这意味着您不能一次将所有电话号码加载到内存中.

Describe an algorithm for finding duplicates in a file of one million phone numbers. The algorithm, when running, would only have two megabytes of memory available to it, which means you cannot load all the phone numbers into memory at once.

我的幼稚"解决方案将是一个 O(n^2) 解决方案,它迭代值并仅以块的形式加载文件,而不是一次全部加载.

My 'naive' solution would be an O(n^2) solution which iterates over the values and just loads the file in chunks instead of all at once.

对于 i = 0 到 999,999

For i = 0 to 999,999

string currentVal = get the item at index i

for j = i+1 to 999,999
  if (j - i mod fileChunkSize == 0)
    load file chunk into array
  if data[j] == currentVal
    add currentVal to duplicateList and exit for

必须有另一种情况,您可以以一种真正独特的方式加载整个数据集并验证一个数字是否重复.有人有吗?

There must be another scenario were you can load the whole dataset in a really unique way and verify if a number is duplicated. Anyone have one?

推荐答案

将文件分成M个块,每个块足够大,可以在内存中排序.在内存中对它们进行排序.

Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory.

对于每两个块的集合,我们将对两个块执行最后一步合并排序以生成一个更大的块 (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)

For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)

指向磁盘上 c_1 和 c_2 上的第一个元素,并创建一个新文件(我们称之为 c_1+2).

Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2).

如果c_1的指向元素小于c_2的指向元素,将其复制到c_1+2并指向c_1的下一个元素.
否则,将 c_2 的指向元素复制到并指向 c_2 的下一个元素.

if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1.
Otherwise, copy c_2's pointed element into and point to the next element of c_2.

重复上一步,直到两个数组都为空.您只需要使用存储两个指向数字所需的内存空间.在此过程中,如果遇到 c_1 和 c_2 指向的元素相等,则发现有重复项,可以将其复制两次并递增两个指针.

Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers.

生成的 m/2 数组可以以相同的方式递归合并 - 需要这些合并步骤的 log(m) 来生成正确的数组.每个数字都将与其他数字进行比较,以找到重复项.

The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates.

另外,@Evgeny Kluev 提到的一个快速而肮脏的解决方案是制作一个尽可能大的布隆过滤器,它可以合理地容纳在内存中.然后,您可以列出未通过布隆过滤器的每个元素的索引列表,并再次遍历文件,以测试这些成员是否重复.

Alternately, a quick and dirty solution as alluded to by @Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.

这篇关于用于检测数据集中因太大而无法完全加载到内存中的重复项的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆