[perl-python]删除重复文件的程序 [英] [perl-python] a program to delete duplicate files

查看:66
本文介绍了[perl-python]删除重复文件的程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个使用我们之前构建的大型练习。


假设你在各种目录中有成千上万的文件。

这些文件是相同的,但你不知道哪些是与b
相同的。编写一个打印出哪个文件的程序

冗余副本。


这是规格。

--- -----------------------

该程序将在命令行中使用。它的参数是一个或多个
更多完整的目录路径。


perl del_dup.pl dir1


打印出dir1中所有文件的完整路径都是重复的。

(包括子目录中的文件)更具体地说,如果文件A有重复的
,A'的完整路径将打印在一行,立即

跟随所有其他文件的完整路径,这是A的副本。这些

重复的完整路径将以前缀为前缀rm串。一个空的

行跟随一组重复。


这是一个示例输出。


inPath / a.jpg

rm inPath / b.jpg

rm inPath / 3 / a.jpg

rm inPath / hh / eu.jpg


inPath / ou.jpg

rm inPath / 23 / a.jpg

rm inPath / hh33 / eu.jpg


订单没关系。 (即哪个文件不会是rm没有

重要。)


------------- -----------

perl del_dup.pl dir1 dir2


将与上面相同,除了dir1或dir2内的重复

本身未被考虑。也就是说,dir1中的所有文件都与dir2中的所有文件进行比较。 (包括子目录)并且,只有dir2

中的文件才会有rm。前缀。


理解这一点的一种方法是想象很多图像文件都在

目录中。可以肯定的是,每个目录中都没有重复

本身。 (想象一下del_dup.pl已经在每个上运行了)

dir1中的文件已被人类分类到子目录中。所以

当dir1和dir2之间存在重复时,需要删除dir2中的

版本,使dir1中的组织保持原样。


perl del_dup.pl dir1 dir2 dir3 ...

与上面相同,除了后面dir中的文件将有rm

优先。所以,如果有这些相同的文件:


dir2 / a

dir2 / b

dir4 / c

dir4 / d


c和d都有rm肯定的前缀。 (哪一个在dir2中有rm

无关紧要)注意,虽然dir2本身并不比较文件

,但重复仍然可能是隐含的通过间接发现

比较。即a == c,b == c,因此a == b,即使a和b是

从未进行过比较。

-------- ------------------


编写程序的Perl或Python版本。


a这个问题的绝对要求是尽量减少文件之间的比较数量。这是规范的一部分。


随时可以随意编写。我会在几天内发布我的版本。

http://www.xahlee.org/perl-python/python.html


Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

here''s a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don''t know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here''s the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A''s full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates''s full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here''s a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn''t compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I''ll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

推荐答案

2005年3月9日04:56:13 -0800,谣言说Xah Lee < xa*@xahlee.org>可能

写了:
On 9 Mar 2005 04:56:13 -0800, rumours say that "Xah Lee" <xa*@xahlee.org> might
have written:
编写程序的Perl或Python版本。

这个问题的绝对要求是最小化文件之间的比较次数。这是规范的一部分。
Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.



http://groups-beta.google.com/group/...8e292ec9adb82d


整体线程是关于找到重复的文件。

-

TZOTZIOY,我说英格兰最好。

发送时要严格,宽容时要。接收" (来自RFC1958)

我真的应该在与人交谈时牢记这一点,实际上......



http://groups-beta.google.com/group/...8e292ec9adb82d

The whole thread is about finding duplicate files.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...


2005年3月9日星期三06:早上56点,Xah Lee写道:
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
这里是一个大型练习,它使用我们之前构建的内容。

假设你在各种目录中有成千上万的文件。其中一些文件是相同的,但您不知道哪些文件与哪些文件相同。编写一个打印出哪个文件是多余副本的程序。
here''s a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don''t know which ones are
identical with which. Write a program that prints out which file are
redundant copies.




对于有兴趣回复上述内容的人来说,一个开始

可能是我为自己编写的这个维护脚本。我不认为b $ b认为它与规范完全匹配,但它解决了这个问题。我写了

这一次来清理一大堆图像文件。所描述的确切行为

需要帮助中提到的'--exec =" ls%s"''选项。


#!/ usr / bin / env python

#(C)2003 Anansi Spaceworks

#-------------------- -------------------------------------------------- -----

#find_duplicates

"""

通过

比较他们的校验和。

""

#------------------ -------------------------------------------------- -------

#这个程序是免费软件;您可以根据GNU通用公共许可证的条款重新分发和/或修改

#,这是由自由软件基金会发布的。许可证的第2版,或者

#(根据您的选择)任何更高版本。



#此程序分发在希望它有用,

#但没有任何保证;甚至没有暗示的保证

#适销性或特定用途的适用性。有关详细信息,请参阅

#GNU通用公共许可证。



#您应该已收到GNU通用公共许可证的副本

#以及这个程序;如果没有,请写信给自由软件

#Foundation,Inc.,59 Temple Place,Suite 330,Boston,MA 02111-1307 USA

#----- -------------------------------------------------- --------------------


导入os,sys,md5,getopt

def file_walker (tbl,srcpath,files):

"""

访问路径并收集其中文件的数据(包括校验和)。
$对于文件中的文件,b $ b"""



filepath = os.path.join(srcpath,file)

if os.path.isfile(filepath):

chksum = md5.new(open(os.path.join(srcpath,file))。read())。digest()

如果不是tbl.has_key(chksum):tbl [chksum] = []

tbl [chksum] .append(filepath)


def find_duplicates(treeroot,tbl = None):

"""

在目录中查找重复文件。

""" ;

dup = {}

如果tbl为None:tbl = {}

os.path.walk(treeroot,file_wal ker,tbl)

表示k,v表示tbl.items():

if len(v)> 1:

dup [k] = v

返回重复


用法="""

用法:find_duplicates< options> [< path ...]



路径集合中查找重复文件(通过匹配md5校验和)(默认为当前目录)。


请注意,搜索到的路径的顺序将在生成的重复文件列表中保留

。这可以用

与--exec和--index自动处理。


选项:

-h, - H, - help

打印此帮助。


-q, - quiet

不打印正常报告。


-x, - exec =<命令字符串>

Python格式化的命令字符串,用于索引

找到每个重复组中的副本。例如。尝试

--exec =" ls%s"


-n, - index =< index into duplicatelicates>

在一系列重复项中使用哪个。以1开头。

默认为1(即列出的第一个文件)。


示例:

您已将路径./A中的许多文件复制到路径./B中。你想要

删除你已经处理过的所有,但不是

删除其他任何东西:


%find_duplicates -q --exec =" rm%s" --index = 1 ./A ./B

"""


def main():

action = None

quiet = 0

index = 1

dup = {}


opts ,args = getopt.getopt(sys.argv [1:],''qhHn:x:'',

[''quiet'',''help'',''exec =' ',''index =''])


for opt,val in opts:

if opt in('' - h'','' -H'','' - help''):

打印使用量

sys.exit()

elif opt in(' '-x'','' - exec''):

action = str(val)

elif opt in(''-n'','' --index''):

index = int(val)

elif opt in('' - q'','' - quiet''):

quiet = 1


如果len(args)== 0:

dup = find_duplicates(''。'')

else:

tbl = {}
arg中arg的


dup = find_duplicates(arg,tbl = tbl)


代表k,v代表dup.items( ):

如果不安静:

print" Duplicates:"

for f in v:print" \t%s" ; %f

如果采取行动:

os.system(action%v [index-1])


if __name __ == ''__main__'':

main()


-

-

特里Hancock(hancock at anansispaceworks.com)

Anansi Spaceworks http:// www。 anansispaceworks.com



For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don''t
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the ''--exec="ls %s"'' option as mentioned in the help.

#!/usr/bin/env python
# (C) 2003 Anansi Spaceworks
#---------------------------------------------------------------------------
# find_duplicates
"""
Utility to find duplicate files in a directory tree by
comparing their checksums.
"""
#---------------------------------------------------------------------------
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#---------------------------------------------------------------------------

import os, sys, md5, getopt
def file_walker(tbl, srcpath, files):
"""
Visit a path and collect data (including checksum) for files in it.
"""
for file in files:
filepath = os.path.join(srcpath, file)
if os.path.isfile(filepath):
chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
if not tbl.has_key(chksum): tbl[chksum]=[]
tbl[chksum].append(filepath)

def find_duplicates(treeroot, tbl=None):
"""
Find duplicate files in directory.
"""
dup = {}
if tbl is None: tbl = {}
os.path.walk(treeroot, file_walker, tbl)
for k,v in tbl.items():
if len(v) > 1:
dup[k] = v
return dup

usage = """
USAGE: find_duplicates <options> [<path ...]

Find duplicate files (by matching md5 checksums) in a
collection of paths (defaults to the current directory).

Note that the order of the paths searched will be retained
in the resulting duplicate file lists. This can be used
with --exec and --index to automate handling.

Options:
-h, -H, --help
Print this help.

-q, --quiet
Don''t print normal report.

-x, --exec=<command string>
Python-formatted command string to act on the indexed
duplicate in each duplicate group found. E.g. try
--exec="ls %s"

-n, --index=<index into duplicates>
Which in a series of duplicates to use. Begins with ''1''.
Default is ''1'' (i.e. the first file listed).

Example:
You''ve copied many files from path ./A into path ./B. You want
to delete all the ones you''ve processed already, but not
delete anything else:

% find_duplicates -q --exec="rm %s" --index=1 ./A ./B
"""

def main():
action = None
quiet = 0
index = 1
dup = {}

opts, args = getopt.getopt(sys.argv[1:], ''qhHn:x:'',
[''quiet'', ''help'', ''exec='', ''index=''])

for opt, val in opts:
if opt in (''-h'', ''-H'', ''--help''):
print usage
sys.exit()
elif opt in (''-x'', ''--exec''):
action = str(val)
elif opt in (''-n'', ''--index''):
index = int(val)
elif opt in (''-q'', ''--quiet''):
quiet = 1

if len(args)==0:
dup = find_duplicates(''.'')
else:
tbl = {}
for arg in args:
dup = find_duplicates(arg, tbl=tbl)

for k, v in dup.items():
if not quiet:
print "Duplicates:"
for f in v: print "\t%s" % f
if action:
os.system(action % v[index-1])

if __name__==''__main__'':
main()

--
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks http://www.anansispaceworks.com


我写了类似的东西,看看
http://www.homepages.lu/pu/fdups.html
I wrote something similar, have a look at
http://www.homepages.lu/pu/fdups.html.


这篇关于[perl-python]删除重复文件的程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆