如何使用Python Generator区分两个文件 [英] How to diff the two files using Python Generator
问题描述
我有一个100GB的文件,其中有1到1000000000000之间用换行符分隔.在此缺少一些行,例如5、11、19919等.我的Ram大小是8GB.
I have one file of 100GB having 1 to 1000000000000 separated by new line. In this some lines are missing like 5, 11, 19919 etc. My Ram size is 8GB.
如何找到缺失的元素.
我的想法是获取另一个文件for i in range(1,1000000000000)
使用生成器逐行读取行.我们可以为此使用 yield 语句
My idea take another file for i in range(1,1000000000000)
read the lines one by one using the generator. can we use yield statement for this
可以帮助编写代码
我的代码,下面的代码作为清单列出,下面的代码可以用于生产吗?
My Code, the below code taking as a list in does the below code can use it for production.?
def difference(a,b):
with open(a,'r') as f:
aunique=set(f.readlines())
with open(b,'r') as f:
bunique=set(f.readlines())
with open('c','a+') as f:
for line in list(bunique - aunique):
f.write(line)
推荐答案
您可以遍历range
生成的所有数字,并将该数字与文件中的下一个数字进行比较.输出缺少的数字,或阅读下一个匹配的下一个数字:
You can iterate over all the numbers generated by range
and keep comparing the number to the next number in the file. Output the number if it's missing, or read the next number for the next match:
with open('numbers') as f:
next_number = 0
for n in range(1000000000001):
if n == next_number:
next_number = int(next(f, 0))
else:
print(n)
演示(假设目标数字从1到10): https://repl.it/repls/WaterloggedUntimelyCoding
Demo (assuming target numbers from 1 to 10 instead): https://repl.it/repls/WaterloggedUntimelyCoding
这篇关于如何使用Python Generator区分两个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!