什么算法用来删除重复? [英] What algorithm to use to delete duplicates?

查看:145
本文介绍了什么算法用来删除重复?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一些文件,名为,例如,a.txt中。我们知道,有一些重复的元素。 a.txt中比内存大的非常大,十次以上,也许各地50GB。有时,乙的大小大约等于A的大小,有时会许多的倍A的尺寸小 让它有这样的结构:

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:

a 1
b 2
c 445
a 1

我们需要得到的文件B.txt,不会有这样的重复。作为例子,应该是这样的:

We need to get file "B.txt", that will not have such duplicates. As example, it should be this:

a 1
b 2
c 445

我想过算法,复制并执行B,然后把B中第一个字符串,并寻找每一个,如果发现相同,删除重复项。然后把第二个字符串,等等。

I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.

但我认为这是办法太慢。我可以使用哪些?

But I think it is way too slow. What can I use?

A是的不可以数据库!没有SQL,请。

A is not database! No SQL, please.

抱歉,但这并不表示,排序就可以了。

Sorry, that didn't said, sorting is OK.

虽然可以进行排序,那么如果它不能进行排序?

推荐答案

假设你能适应 1 / K '个文件到内存中,仍然有余地工作数据结构。整个文件可以在 K 或更少的通行证,如下进行处理,而这具有比排序整个文件取决于线路长和排序算法快得多的机会常数。排序平均为O(n LN N)和下面的过程是 O(KN)最坏的情况。例如,如果平均行10个字符,并有n = 5G线, LN(N)〜22.3 。此外,如果你的输出文件 B 大于输入文件 A 小很多,这个过程很可能只需要一或两遍。

Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.

流程:

  1. 分配几兆字节的输入缓冲区我,几GB的结果缓冲区R和一个千兆字节左右的哈希表H.打开输入文件F和输出文件O操作。
  2. 重复:选自F填充我并处理成R,通过第3步
  3. 对于每一个线L的我,检查是否L是已经在H和R.如果是这样,进入下一个L,否则添加左至右和哈希小时。
  4. 当R是满的,说以M项,将其写入O.然后反复选自F填补我,dedup在第3步,并写入O.在EOF(F)到5。
  5. 在重复(使用旧的O输入F和输出一个新的O):从F请登录M行复制到O.然后负载R和H步骤2和3,并复制到EOF(F)与dedup像之前一样。集合M的非dupped线在每个O文件的开头的新号码。

请注意,每个传球后,澳的前M行中不重复的,没有那些M行是重复的O.其余因此,至少 1 / K '个原始文件的每道加工,所以处理的最多 K 通过。

Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.

更新1 而不是反复写出来,读回的已处理领导线,一个单独的输出文件p应使用哪个过程缓冲区R被附加在每遍结束。这通过一个因素减少阅读和写作量 K / 2 时,结果文件B是几乎一样大A,或者稍差,当B比A要小得多;但在任何情况下,它增加I / O量。

Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.

这篇关于什么算法用来删除重复?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆