删除大文本文件中的所有重复 [英] Remove All Duplicates In A Large Text File

查看:139
本文介绍了删除大文本文件中的所有重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的很困惑这个问题,结果我停了一会儿。我使用了大量的数据。我每周得到约200gb的.txt数据。数据可以达到5亿行。其中很多都是重复的。我猜只有20gb是独一无二的。我已经有几个自定义程序,包括哈希删除重复,外部删除重复,但没有似乎工作。最新的一个是使用临时数据库,但需要花费几天的时间才能删除数据。



所有程序的问题是,在某一点后,这些程序上的大量资金,我以为我会上网,看看有没有人可以帮忙。我明白这已经在这里回答了,我花了最近3个小时阅读了大约50个线程在这里,但似乎没有与我相同的问题,即巨大的数据集。



<任何人都可以为我推荐任何东西吗?它需要超精准和快速。最好不要以内存为基础,因为我只有32GB的RAM才能使用。

解决方案

删除重复项的标准方法是排序文件,然后执行顺序传递以删除重复。排列5亿行并不是微不足道的,但它肯定是可行的。几年前,我有一个每天的过程,可以在一台16 GB的机器上排序50到100吉字节。



顺便说一下,你可以用一个现成的方案。当然,GNU排序实用程序可以排序大于内存的文件。我从来没有尝试过500 GB的文件,但你可以给它一个镜头。您可以与 GNU Core Utilities 的其余部分一起下载。该实用程序具有 - 唯一选项,因此您应该只能 sort --unique input-file>输出文件。它使用类似于我在下面描述的技术。我建议先尝试一个100兆字节的文件,然后慢慢地处理较大的文件。



使用GNU排序和下面描述的技术,它将执行如果输入和临时目录位于单独的物理磁盘上,则会更好。将输出放在第三个物理磁盘上,或者放在与输入相同的物理磁盘上。您希望尽可能减少I / O争用。



可能还有一个商业(即付费)程序将进行排序。开发一个能够高效排序大文本文件的程序是一项非常简单的任务。如果你可以买几百美元的东西,如果你的时间值得任何东西,你可能是钱。



如果你不能使用一个现成的程序, 然后 。 。 。



如果您的文字在多个较小的文件中,则问题更容易解决。您首先对每个文件进行排序,从这些文件中删除重复项,然后编写已删除重复项的排序临时文件。然后运行一个简单的n路合并,将文件合并成一个删除重复项的输出文件。



如果你有一个文件,你首先阅读为许多行,你可以进入内存,排序,删除重复和写一个临时文件。你一直在做这个整个大文件。完成后,您可以使用一些排序的临时文件,然后可以合并。



在伪代码中,它看起来像这样:

  fileNumber = 0 
而不是输入结束
加载尽可能多的行,可以进入列表
排序列表
filename =file+ fileNumber
将排序列表写入文件名,可选择删除重复项
fileNumber = fileNumber + 1

您不需要从临时文件中删除重复的内容,但如果您的唯一数据真的只占总数的10%,那么您将节省大量的不要将副本输出到临时文件中的时间。



一旦您的所有临时文件都被写入,您需要合并它们。从您的描述中,我将从文件中读取的每个块都将包含大约2000万行的地方。所以你可能有25个临时文件可以使用。



你现在需要做一个k-way合并。这是通过创建优先级队列来完成的。您打开每个文件,从每个文件中读取第一行,并将其放入队列中,并引用它来自的文件。然后,从队列中取出最小的项目,并将其写入输出文件。要删除重复项,您可以跟踪输出的上一行,如果与上一行相同,则不输出新行。



ve输出行,您从文件中读取刚刚输出的下一行,并将该行添加到优先级队列中。你继续这样做,直到你清空所有的文件。



我有一段时间后发表了一些关于排序非常大的文本文件。它使用我上面描述的技术。它不做的唯一的事情是删除重复,但这是对输出临时文件和最终输出方法的方法的简单修改。即使没有优化,程序执行得很好。它不会设置任何速度记录,但它应该能够在不到12小时内从5亿行排序和删除重复。可能要少得多,考虑到第二遍只能使用总数据的一小部分(因为您从临时文件中删除了重复项)。



有一件事你可以做加速程序是操作较小的块,并在后台线程中排序一个块,而您将下一个块加载到内存中。你最终不得不处理更多的临时文件,但这真的不是问题。堆操作稍慢,但是通过将输入和输出与排序重叠,额外的时间不仅仅是重新捕获。你最终得到I / O本质上是免费的。在典型的硬盘驱动器速度下,装载500千兆字节将需要两到半到三个小时的时间。



查看文章系列。它有许多不同的,大部分是小的文章,让您完成我描述的整个过程,并提供工作代码。我很乐意回答您可能会遇到的任何问题。


I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.

The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.

Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.

解决方案

The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.

By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.

With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.

There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.

If you can't use a ready made program, then . . .

If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.

If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.

In pseudocode, it looks something like this:

fileNumber = 0
while not end-of-input
    load as many lines as you can into a list
    sort the list
    filename = "file"+fileNumber
    write sorted list to filename, optionally removing duplicates
    fileNumber = fileNumber + 1

You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.

Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.

You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.

Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.

I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).

One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.

Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.

这篇关于删除大文本文件中的所有重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆