在一个文件中查找不在python中另一个文件中的所有数字 [英] Find all the numbers in one file that are not in another file in python

查看:80
本文介绍了在一个文件中查找不在python中另一个文件中的所有数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有两个文件,分别是FileA和FileB,我们需要找到FileA中所有的数字,而FileB中没有这些数字. FileA中的所有数字均已排序,FileB中的所有数字均已排序.例如,

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,

输入:

FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]

输出:

[2, 5, ...]

内存非常有限,甚至无法一次将一个完整的文件加载到内存中.还需要线性或更短的时间复杂度.

The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.

因此,如果文件足够小以适合内存,我们可以加载它们并将其内容初始化为两个集合,然后取一个集合差,以便以O(1)或恒定时间复杂度解决问题. /p>

So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.

set(contentsofFileA)-set(contentsofFileB)

但是由于文件太大,它们将无法完全加载到内存中,因此这是不可能的.

But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.

另外,另一种方法是在批处理中使用蛮力方法.因此,我们从FileA加载一个数据块或一批数据,然后从FileB加载数据块,然后比较它,然后再从FileB加载下一个数据块,依此类推.然后,在检查了FileB中的所有元素之后,检查了FileA块,然后从FileA加载下一批,然后继续进行.但这会产生O(n ^ 2)或二次时间复杂度,对于具有大条目的超大文件来说效率不高.

Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.

该问题需要以线性或更短的时间复杂度来解决,并且不将整个文件加载到内存中.有帮助吗?

The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?

推荐答案

如果由于内存不足而需要逐行读取文件,并且需要线性解决方案,则可以使用iter文件是基于行的,否则请参见:

If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:

首先在终端中,您可以执行以下操作来生成一些测试文件:

First in your terminal you can do this to generate some test files:

seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt

然后您运行以下代码:

i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
    try:
        if a < b:
            aNotB += [a]
            a = int(next(i1, None))
        elif a > b:
            # bNotA += [a]
            b = int(next(i2, None))
        elif a == b:
            a = int(next(i1, None))
            b = int(next(i2, None))
    except TypeError:
        if not b:
            aNotB += list(i1)
            break
        else:
            # bNotA += list(i1)
            break
print(aNotB)

输出:

[3、9、15、21、27、33、39、45、51、57、63、69、75、81、87、93、99] 如果您同时希望aNotB和bNotA的结果,则可以取消注释这两行.

[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99] If you want both the result for aNotB and bNotA you can uncomment those two lines.

与Andrej Kesely的答案进行时间比较:

Timing comparison with Andrej Kesely's answer:

$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py        
python manual_iter.py  0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py        
python heapq_groupby.py  1.11s user 0.00s system 99% cpu 1.116 total

这篇关于在一个文件中查找不在python中另一个文件中的所有数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆