查找重复的文件并将其删除 [英] Finding duplicate files and removing them

查看:146
本文介绍了查找重复的文件并将其删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Python程序来从文件夹中查找和删除重复的文件。



我有多个mp3文件副本和其他一些文件。我正在使用sh1算法。



如何找到这些重复文件并删除它们?

解决方案

递归文件夹版本:



此版本使用文件大小和内容的散列来查找重复项。
您可以传递多个路径,它将递归地扫描所有路径并报告找到的所有重复项。

  import sys 
import os
import hashlib

def chunk_reader(fobj,chunk_size = 1024):
以块为单位读取文件的生成器
while True:
chunk = fobj.read(chunk_size)
如果不是chunk:
return
yield chunk

def check_for_duplicates(paths,hash = hashlib.sha1):
hashes = {}
路径中的路径:
for dirpath,dirnames,os.walk(路径)中的文件名:
文件名中的文件名:
full_path = os.path.join(dirpath,filename)
hashobj = hash()
用于chunk_reader中的块(open(full_path,'rb')):
hashobj。 update(chunk)
file_id =(hashobj.digest(),os.path.getsize(full_path))
duplicate = hashes.get(f ile_id,None)
如果重复:
打印重复找到:%s和%s%(full_path,duplicate)
else:
hashes [file_id] = full_path

如果sys.argv [1:]:
check_for_duplicates(sys.argv [1:])
else:
打印请将路径作为参数检查脚本


I am writing a Python program to find and remove duplicate files from a folder.

I have multiple copies of mp3 files, and some other files. I am using the sh1 algorithm.

How can I find these duplicate files and remove them?

解决方案

Recursive folders version:

This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"

这篇关于查找重复的文件并将其删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆