如何查找重复目录 [英] How to find duplicate directories

查看:62
本文介绍了如何查找重复目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们创建一些测试目录树:

Let create some testing directory tree:

#!/bin/bash

top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }

mkfile() { printf "%s\n" $(basename "$1") > "$1"; }

mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"

结构为: find testdir \(-type d -printf%p/\ n",-type f -print \)

testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z

我需要找到重复的目录,但是我只需要考虑文件(例如,我应该忽略没有文件的(子)目录).因此,从上面的测试树中,所需的结果是:

I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:

duplicate directories:
testdir/d1
testdir/d2/d1some

因为在这两个(子)树中只有两个相同的文件 a b .(以及几个目录,没有文件).

because in both (sub)trees are only two identical files a and b. (and several directories, without files).

当然,我可以 md5deep -Zr.,也可以使用 perl 脚本(使用摘要:: MD5 或使用 Path :: Tiny 等),然后计算文件的md5-digests,但这对查找重复的目录 ...:(

Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(

任何想法如何做到这一点?老实说,我没有任何想法.

Any idea how to do this? Honestly, I haven't any idea.

编辑

  • 我不需要工作的 code .(我可以自己编写代码)
  • 我只是"需要一些解决方法"的想法.:)
  • I don't need working code. (I'm able to code myself)
  • I "just" need some ideas "how to approach" the solution of the problem. :)

Edit2

背后的理由-为什么需要这样做:由于备份策略错误,我从许多外部HDD复制了大约2.5 TB数据.例如.多年来,整个 $ HOME 目录都被复制到(许多不同的)外部HDD中.许多子目录具有相同的内容,但是它们的路径不同.所以,现在我试图消除相同内容的目录.

The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.

我需要通过目录执行此操作,因为这是目录,其中包含一些重复文件,但不是全部.让我们说:

And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:

/some/path/project1/a
/some/path/project1/b

/some/path/project2/a
/some/path/project2/x

例如 a 是一个重复的文件(不仅是名称,还包括内容)-但两个项目都需要.所以我想在两个目录中都保留 a -即使它们是重复文件.因此,我在寻找一种逻辑"方法来查找重复的目录.

e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.

推荐答案

一些关键点:

  • 如果我理解正确(从您的评论中,您说:((当我说相同的文件时,我指的是内容相同,而不是名称相同)" ),您想查找重复的文件目录,例如,其内容与某些其他目录中的内容完全相同,不管文件名.
  • 为此,您必须为文件计算一些校验和或摘要.相同的摘要=相同的文件.(很有可能).:)正如您已经说过的, md5deep -Zr -of/top/dir 是一个很好的起点.
  • 我添加了 -of ,因为对于这样的工作,您不想计算符号链接目标的内容,或者诸如fifo之类的其他特殊文件-仅是普通文件.
  • 为2.5TB树中的每个文件计算 md5 ,除非您的计算机速度非常快,否则肯定要花几个小时. md5deep 为每个cpu-core运行一个线程.因此,在运行时,您可以制作一些脚本.
  • 另外,请考虑将 md5deep 作为 sudo 运行,因为如果长时间运行后您会收到一些有关不可读文件的错误消息,这可能会令人沮丧因为您忘了更改文件所有权...(仅需注意):):)
  • If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
  • for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
  • I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
  • calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
  • Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)

有关操作方法":

  • 要比较目录",您需要计算一些目录摘要",以便于比较和查找重复项.
  • 最重要的是要意识到以下关键点:
    • 您可以排除目录,其中包含唯一摘要的文件.如果文件是唯一的,例如没有任何重复项,这意味着检查目录是毫无意义的.在某些目录中的唯一文件意味着该目录也是唯一的.因此,该脚本应该忽略每个目录,这些目录中的文件具有唯一的 MD5 摘要(来自 md5deep 的输出)
    • 您不需要根据文件本身计算目录摘要".(在您的后续问题中尝试时).使用已经计算出的文件md5来计算目录摘要"就足够了,只需确保首先对它们进行排序!
    • For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
    • The one most important thing is realize the following key points:
      • you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
      • You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!

      例如例如,如果您的目录/path/to/some 仅包含两个文件 a b

      e.g. for example if your directory /path/to/some containing only two files a and b and

      if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
      and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
      

      您可以根据上述文件摘要来计算目录摘要",例如使用 Digest :: MD5 您可以:

      you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:

      perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
      

      ,将获得 3bc22fb7aaebe9c8c5d7de312b876bb8 作为目录摘要".这里的排序至关重要(!),因为相同的命令,但是没有排序:

      and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:

      perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
      

      产生: 3a13f2408f269db87ef0110a90e168ae .

      请注意,即使上述摘要不是文件的摘要,但对于包含不同文件的每个目录来说,它们都是唯一的,而对于相同的文件,它们将是相同的.(因为相同的文件,所以具有相同的md5文件摘要).排序可确保您始终以相同的顺序计算摘要,例如如果其他目录将包含两个文件

      Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files

      file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
      file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
      

      使用上述 sort和md5 ,您将再次获得: 3bc22fb7aaebe9c8c5d7de312b876bb8 -例如包含与上面相同文件的目录...

      using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...

      因此,通过这种方式,您可以为每个目录计算一些目录摘要",并可以确保如果您获得另一个目录摘要 3bc22fb7aaebe9c8c5d7de312b876bb8 ,这意味着:该目录与上面的目录完全相同两个文件 a b (即使它们的名称不同).

      So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).

      此方法快速,因为您将仅从32个字节的小字符串计算目录摘要",因此避免了过多的文件摘要计算.

      This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.

      最后一部分现在很容易.您的最终数据应采用以下格式:

      The final part is easy now. Your final data should be in form:

      3a13f2408f269db87ef0110a90e168ae /some/directory
      16ea2389b5e62bc66b873e27072b0d20 /another/directory
      3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
      

      因此,从这里很容易获得:

      so, from this is easy to get: the

      /some/directory /path/to/other/directory 是相同的,因为它们具有相同的目录摘要".

      /some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".

      嗯...以上所有都是perl脚本长的几行.直接将perl脚本作为上面长长的文本答案写在这里可能会更快-但是,您说过-您不需要代码... :):)

      Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)

      这篇关于如何查找重复目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆