从±200万张图像托架中查找并删除图像重复项 [英] Find and remove image duplicates from ±2 million images bay

查看:69
本文介绍了从±200万张图像托架中查找并删除图像重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新闻门户公司有两台服务器(操作系统= Centos 6):

News portal company has two servers (OS = Centos 6):

第一台#1服务器具有大约一百万张图像(.jpg,.png),第二台服务器具有几乎相同的计数-一百万张图像.其中有些是相同的副本,有些是调整大小的副本,有些是模糊的,有些没有模糊的,有些是完全独特的图像.文件名主要也不同.

First #1 server has about 1 million images (.jpg, .png) and server #2 got almost the same count - 1 million of images. Some of them are identic duplicates, some are resized duplicates, some are with blur, some without blur, some are totally unique images. File names mainly are also different.

任务是将两个服务器媒体目录合并为一个.合并后,必须移动重复项(以释放存储空间).

The mission is to merge two servers media catalogue into one. After merge duplicates must be romoved (to free up storage).

我已经使用Imagemagick compare -metric RMSE 进行了一些测试,但是我认为将每个文件与两个服务器中的每个文件进行比较将花费很多时间.因此,将有1百万x 1百万= 1万亿次运算,这将需要很长时间...

I've made some tests with Imagemagick compare -metric RMSE, but i thought that this will take ages to compare each file with each file from two servers. So there will be 1mln x 1mln = 1 trillion operations, this will take ages...

这里有什么建议吗?

推荐答案

对于每个图像,使用 GNU Parallel 仅计算一次:

Use GNU Parallel to calculate just once, for each image:

  • 仅数据校验和

  • a data-only checksum

感知哈希

然后丢弃所有具有相同校验和的校验和,并检查具有相似感知哈希的校验和.

Then discard all the ones with identical checksums and review the ones with similar perceptual hashes.

仅获取图像数据的校验和(即,不包括图像中不同日期的任何元数据):

Get a checksum over the image data only (i.e. not including any meta-data like a different date in your images) using ImageMagick like this:

identify -format "%#" a.jpg
9e51c9cf53fddc7d318341cd7e6c6e34663e5c49f20ede16e29e460dfc63867

链接到感知哈希生成:

  • link
  • pHash.org
  • Fred did an excellent write-up here

这篇关于从±200万张图像托架中查找并删除图像重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆