如何在大型文件系统中查找重复的文件,同时避免使用MemoryError [英] How to find duplicate files in large filesystem whilst avoiding MemoryError

查看:179
本文介绍了如何在大型文件系统中查找重复的文件,同时避免使用MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图避免我的mp3集合中的重复(相当大)。我想通过检查文件内容来检查重复项,而不是查找相同的文件名。我写了下面的代码来做这个,但是在大约一分钟后它会引发一个MemoryError。任何有关如何使这项工作有任何建议?

  import os 
import hashlib

walk = os.walk('H:\MUSIC NEXT GEN')

mySet = set()
dupe = []

hasher = hashlib。 md5()

用于dirpath,subdirs,walk中的文件:
for f in files:
fileName = os.path.join(dirpath,f)
with打开(fileName,'rb')as mp3:
buf = mp3.read()
hasher.update(buf)
hashKey = hasher.hexdigest()
print hashKey
如果mySet中的hashKey:
dupe.append(fileName)
else:
mySet.add(hashKey)


print'Dupes: '+ str(dupe)


解决方案

你可能有一个巨大的无法读取的文件像您尝试使用 mp3.read()一样。阅读较小的部分。把它放在一个很好的小功能也有助于保持你的主程序清洁。这是一个功能,我已经使用了一段时间,现在(现在稍微抛光)一个可能类似于你的工具:

  import hashlib 

def filehash(filename):
with open(filename,mode ='rb')as file:
hasher = hashlib.md5()
while True:
buffer = file.read(1< <20)
如果不是缓冲区:
返回hasher.hexdigest()
hasher.update(buffer)

更新:A readinto 版本:

  buffer = bytearray(1< <20)
def filehash(filename):
with open(filename,mode = rb')as file:
hasher = hashlib.md5()
while True:
n = file.readinto(buffer)
如果不是n:
返回hasher。 hexdigest()
hasher.update(buffer if n == len(buffer)else buffer [:n])

With一个1GB的文件已经缓存在内存中,十次尝试,平均为5.35秒。 阅读平均6.07秒。在这两个版本中,Python进程在运行期间占用大约10MB的RAM。



我可能会坚持使用读取版本,因为我喜欢它的简单性,因为在我的真实用例中,数据尚未缓存在RAM中,我使用sha256(所以总体时间显着增加,并且使 readinto 更加无关紧要)。


I am trying to avoid duplicates in my mp3 collection (quite large). I want to check for duplicates by checking file contents, instead of looking for same file name. I have written the code below to do this but it throws a MemoryError after about a minute. Any suggestions on how I can get this to work?

import os
import hashlib

walk = os.walk('H:\MUSIC NEXT GEN')

mySet = set()
dupe  = []

hasher = hashlib.md5()

for dirpath, subdirs, files in walk:
    for f in files:
        fileName =  os.path.join(dirpath, f)
        with open(fileName, 'rb') as mp3:
            buf = mp3.read()
            hasher.update(buf)
            hashKey = hasher.hexdigest()
            print hashKey
            if hashKey in mySet:
                dupe.append(fileName)
            else:
                mySet.add(hashKey)


print 'Dupes: ' + str(dupe)

解决方案

You probably have a huge file that can't be read at once like you try with mp3.read(). Read smaller parts instead. Putting it into a nice little function also helps keeping your main program clean. Here's a function I've been using myself for a while now (just slightly polished it now) for a tool probably similar to yours:

import hashlib

def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            buffer = file.read(1 << 20)
            if not buffer:
                return hasher.hexdigest()
            hasher.update(buffer)

Update: A readinto version:

buffer = bytearray(1 << 20)
def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            n = file.readinto(buffer)
            if not n:
                return hasher.hexdigest()
            hasher.update(buffer if n == len(buffer) else buffer[:n])

With a 1GB file already cached in memory and ten attempts, this took on average 5.35 seconds. The read version took on average 6.07 seconds. In both versions, the Python process occupied about 10MB of RAM during the run.

I'll probably stick with the read version, as I prefer its simplicity and because in my real use cases, the data isn't already cached in RAM and I use sha256 (so the overall time goes up significantly and makes the little advantage of readinto even more irrelevant).

这篇关于如何在大型文件系统中查找重复的文件,同时避免使用MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆