`uniq`不排序一个巨大的文本文件? [英] `uniq` without sorting an immense text file?

查看:121
本文介绍了`uniq`不排序一个巨大的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个愚蠢的大的文本文件(即40千兆字节为今天),我想过滤的独特线条不排序文件。

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

该文件具有Unix的行结束,和所有内容匹配 [:打印:]] 。我尝试了以下awk脚本只显示独特的行:

The file has unix line endings, and all content matches [[:print:]]. I tried the following awk script to display only unique lines:

awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt

的想法是,我会通过引用它的元素,使用文件作为键的内容,然后跳过那些已经在阵列中的行填充数组。但这种失败的原因有两个 - 首先,因为它莫名其妙地只是不工作(即使在小测试文件),其次是因为我知道我的系统将一整套独特的线前耗尽内存加载到内存中通过AWK。

The thought was that I'd populate an array by referencing its elements, using the contents of the file as keys, then skip lines that were already in the array. But this fails for two reasons -- firstly, because it inexplicably just doesn't work (even on small test files), and secondly because I know that my system will run out of memory before the entire set of unique lines is loaded into memory by awk.

搜索后,我发现其中建议这个答案

awk '!x[$0]++'

虽然这部作品的小文件,它也将阅读我的整个文件之前耗尽内存。

And while this works on small files, it also will run out of memory before reading my entire file.

什么是更好的(即工作)解决方案?我接受任何事情,但我更偏向于解决方案的语言,我知道(庆典&安培; AWK,因此标签)。在试图以可视化的问题,我拿出最好将存储线校验或MD5s而不是线本身的数组,但既节省一点空间,并运行校验和冲突的风险。

What's a better (i.e. working) solution? I'm open to just about anything, though I'm more partial to solutions in languages I know (bash & awk, hence the tags). In trying to visualize the problem, the best I've come up with would be to store an array of line checksums or MD5s rather than the lines themselves, but that only saves a little space and runs the risk of checksum collisions.

任何提示将非常欢迎。告诉我,这是不可能的也是值得欢迎的,所以我停止试图弄明白。 :-P

Any tips would be very welcome. Telling me this is impossible would also be welcome, so that I stop trying to figure it out. :-P

推荐答案

的awk'!X [$ 0] ++'技巧是最优雅的解决方案之一去复制一个文件或流不排序。但是,它是在内存方面效率低,不适合用于大文件,因为它节省了所有唯一行到内存中。

The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.

但是,一个更有效的实施将节省一个定长的的阵列,而不是整条生产线的线重新presentation。您可以在一个符合的Perl 实现这一目标,这是颇为相似 AWK 脚本。

However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.

perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt

下面我用 md5_base64 而不是 md5_hex 因为base64编码需要22个字节,而六角重新presentation 32.

Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.

然而,由于Perl实现的散列仍需要大约120bytes每个键,您可能会很快耗尽内存为你巨大的文件。

However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.

在这种情况下,该解决方案是处理在块的文件中,手动或使用的GNU分裂与--pipe,--keep顺序和选项--block并行(考虑一个事实,即重复行相隔不远,正如你所提到的优势)。这里是你如何与平行做到这一点:

The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:

cat huge.txt | pv | 
parallel --pipe --keep-order --block 100M -j4 -q \
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt

- 块100M 选项告诉并行处理在100MB的块输入。 -j4 表示开始并行4个过程。这里的一个重要论点是 - 保序,因为你想要的独特线路输出,在保持相同的顺序。我已经包括光伏的管道,而长时间运行的进程正在执行得到一些不错的数据。

The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.

在我用随机数据1GB的文件执行的标杆,我达到了130MB /秒的吞吐量与上面的设置,这意味着在4分钟内你可以去复制你的40GB文件(如果你有一个足够快的硬盘能以这样的速度来写)。

In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).

其他选项包括:

  • Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
  • Sort your huge file with an external merge sort or distribution/bucket sort
  • Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
  • Or use bloom filters

下面是一个使用布鲁姆的例子: :Perl的更快模块:

perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt

您可以安装布鲁姆::更快 CRAN 须藤CRAN 然后安装布鲁姆::更快

说明:


  • 您必须指定概率误差率电子和可用的桶数 N 。每个桶所需的内存约为2.5个字节。如果你的文件有100个亿独立行,那么你将需要亿桶和周围的内存260MB。

  • $ F-&gt;添加($ _)函数添加一行到过滤器并返回的哈希值真正如果密钥(也就是这里的线)是重复的。

  • 您可以在您的文件中唯一的行数的估计,解析您的文件的一小部分以 DD如果= huge.txt BS = 400M数= 1 | AWK'一[$ 0] ++!|厕所-l (400MB)和100(40GB)的数量乘以。然后设置 N 选项更高一点的是在安全方面。

  • You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
  • The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
  • You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.

在我的基准,这种方法实现了6MB / s的处理速度。您可以将此方法平行建议结合使用GNU上述利用多个内核,实现了更高的吞吐量。

In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.

这篇关于`uniq`不排序一个巨大的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆