我如何才能找到两个大文件有效地使用python交集? [英] How can I find intersection of two large file efficiently using python?

查看:543
本文介绍了我如何才能找到两个大文件有效地使用python交集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大的文件。它们的内容是这样的:

I have two large files. Their contents looks like this:

134430513
  125296589
  151963957
  125296589

134430513
125296589
151963957
125296589

该文件包含ID的未分类的列表。一些的ID可能会出现超过一次在单个文件中。

The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.

现在我想找到路口的两个文件的一部分。这是IDS出现在这两个文件。

Now I want to find the intersection part of two files. That is the ids appear in both files.

我刚刚看了这两个文件到2台, S1 S2 。并获得由 s1.intersection(S2)的交集。但是,它消耗大量的内存和运行速度慢。

I just read the two files into 2 sets, s1 and s2. And get the intersection by s1.intersection(s2) . But it consumes a lot of memory and seems slow.

那么,有没有更好的,或Python的方式做到这一点?如果该文件包含无法被读入这么多id的设置内存有限,我该怎么办?

So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set with limited memory, what can I do?

编辑:我使用的是生成文件读入到2集:

I read the file into 2 sets using a generator:

def id_gen(path):
    for line in open(path):
        tmp = line.split()
        yield int(tmp[0])

c1 = id_gen(path)
s1 = set(c1)

所有ID都是数字。而最大的ID可能是5000000000.如果使用bitarray,它会占用更多的内存。

All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.

推荐答案

其他显示,在这样做的更地道的方式 Python的,但如果数据的大小真的是太大了,你可以 使用系统设置程序进行排序并消除重复的话 使用的事实,文件是一个迭代器返回一行 在同一时间,做一些这样的:

Others have shown the more idiomatic ways of doing this in Python, but if the size of the data really is too big, you can use the system utilities to sort and eliminate duplicates, then use the fact that a File is an iterator which returns one line at a time, doing something like:

import os
os.system('sort -u -n s1.num > s1.ns')
os.system('sort -u -n s2.num > s2.ns')
i1 = open('s1.ns', 'r')
i2 = open('s2.ns', 'r')
try:
    d1 = i1.next()
    d2 = i2.next()
    while True:
        if (d1 < d2):
            d1 = i1.next()
        elif (d2 < d1):
            d2 = i2.next()
        else:
            print d1,
            d1 = i1.next()
            d2 = i2.next()
except StopIteration:
    pass

此可避免具有多于一个线路的时间(对于每个文件) 在内存(和系统排序应该是比什么都更快 Python可以做的,因为它是这一个任务优化)。

This avoids having more than one line at a time (for each file) in memory (and the system sort should be faster than anything Python can do, as it is optimized for this one task).

这篇关于我如何才能找到两个大文件有效地使用python交集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆