如何在大型文件系统中查找重复的文件,同时避免使用MemoryError [英] How to find duplicate files in large filesystem whilst avoiding MemoryError
问题描述
import os
import hashlib
walk = os.walk('H:\MUSIC NEXT GEN')
mySet = set()
dupe = []
hasher = hashlib。 md5()
用于dirpath,subdirs,walk中的文件:
for f in files:
fileName = os.path.join(dirpath,f)
with打开(fileName,'rb')as mp3:
buf = mp3.read()
hasher.update(buf)
hashKey = hasher.hexdigest()
print hashKey
如果mySet中的hashKey:
dupe.append(fileName)
else:
mySet.add(hashKey)
print'Dupes: '+ str(dupe)
你可能有一个巨大的无法读取的文件像您尝试使用 mp3.read()
一样。阅读较小的部分。把它放在一个很好的小功能也有助于保持你的主程序清洁。这是一个功能,我已经使用了一段时间,现在(现在稍微抛光)一个可能类似于你的工具:
import hashlib
def filehash(filename):
with open(filename,mode ='rb')as file:
hasher = hashlib.md5()
while True:
buffer = file.read(1< <20)
如果不是缓冲区:
返回hasher.hexdigest()
hasher.update(buffer)
更新:A readinto
版本:
buffer = bytearray(1< <20)
def filehash(filename):
with open(filename,mode = rb')as file:
hasher = hashlib.md5()
while True:
n = file.readinto(buffer)
如果不是n:
返回hasher。 hexdigest()
hasher.update(buffer if n == len(buffer)else buffer [:n])
With一个1GB的文件已经缓存在内存中,十次尝试,平均为5.35秒。 阅读
平均6.07秒。在这两个版本中,Python进程在运行期间占用大约10MB的RAM。
我可能会坚持使用读取
版本,因为我喜欢它的简单性,因为在我的真实用例中,数据尚未缓存在RAM中,我使用sha256(所以总体时间显着增加,并且使 readinto
更加无关紧要)。
I am trying to avoid duplicates in my mp3 collection (quite large). I want to check for duplicates by checking file contents, instead of looking for same file name. I have written the code below to do this but it throws a MemoryError after about a minute. Any suggestions on how I can get this to work?
import os
import hashlib
walk = os.walk('H:\MUSIC NEXT GEN')
mySet = set()
dupe = []
hasher = hashlib.md5()
for dirpath, subdirs, files in walk:
for f in files:
fileName = os.path.join(dirpath, f)
with open(fileName, 'rb') as mp3:
buf = mp3.read()
hasher.update(buf)
hashKey = hasher.hexdigest()
print hashKey
if hashKey in mySet:
dupe.append(fileName)
else:
mySet.add(hashKey)
print 'Dupes: ' + str(dupe)
You probably have a huge file that can't be read at once like you try with mp3.read()
. Read smaller parts instead. Putting it into a nice little function also helps keeping your main program clean. Here's a function I've been using myself for a while now (just slightly polished it now) for a tool probably similar to yours:
import hashlib
def filehash(filename):
with open(filename, mode='rb') as file:
hasher = hashlib.md5()
while True:
buffer = file.read(1 << 20)
if not buffer:
return hasher.hexdigest()
hasher.update(buffer)
Update: A readinto
version:
buffer = bytearray(1 << 20)
def filehash(filename):
with open(filename, mode='rb') as file:
hasher = hashlib.md5()
while True:
n = file.readinto(buffer)
if not n:
return hasher.hexdigest()
hasher.update(buffer if n == len(buffer) else buffer[:n])
With a 1GB file already cached in memory and ten attempts, this took on average 5.35 seconds. The read
version took on average 6.07 seconds. In both versions, the Python process occupied about 10MB of RAM during the run.
I'll probably stick with the read
version, as I prefer its simplicity and because in my real use cases, the data isn't already cached in RAM and I use sha256 (so the overall time goes up significantly and makes the little advantage of readinto
even more irrelevant).
这篇关于如何在大型文件系统中查找重复的文件,同时避免使用MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!