从±200万张图像托架中查找并删除图像重复项 [英] Find and remove image duplicates from ±2 million images bay
问题描述
新闻门户公司有两台服务器(操作系统= Centos 6):
News portal company has two servers (OS = Centos 6):
第一台#1服务器具有大约一百万张图像(.jpg,.png),第二台服务器具有几乎相同的计数-一百万张图像.其中有些是相同的副本,有些是调整大小的副本,有些是模糊的,有些没有模糊的,有些是完全独特的图像.文件名主要也不同.
First #1 server has about 1 million images (.jpg, .png) and server #2 got almost the same count - 1 million of images. Some of them are identic duplicates, some are resized duplicates, some are with blur, some without blur, some are totally unique images. File names mainly are also different.
任务是将两个服务器媒体目录合并为一个.合并后,必须移动重复项(以释放存储空间).
The mission is to merge two servers media catalogue into one. After merge duplicates must be romoved (to free up storage).
我已经使用Imagemagick compare -metric RMSE
进行了一些测试,但是我认为将每个文件与两个服务器中的每个文件进行比较将花费很多时间.因此,将有1百万x 1百万= 1万亿次运算,这将需要很长时间...
I've made some tests with Imagemagick compare -metric RMSE
, but i thought that this will take ages to compare each file with each file from two servers. So there will be 1mln x 1mln = 1 trillion operations, this will take ages...
这里有什么建议吗?
推荐答案
对于每个图像,使用 GNU Parallel 仅计算一次:
Use GNU Parallel to calculate just once, for each image:
-
仅数据校验和
a data-only checksum
感知哈希
然后丢弃所有具有相同校验和的校验和,并检查具有相似感知哈希的校验和.
Then discard all the ones with identical checksums and review the ones with similar perceptual hashes.
仅获取图像数据的校验和(即,不包括图像中不同日期的任何元数据):
Get a checksum over the image data only (i.e. not including any meta-data like a different date in your images) using ImageMagick like this:
identify -format "%#" a.jpg
9e51c9cf53fddc7d318341cd7e6c6e34663e5c49f20ede16e29e460dfc63867
链接到感知哈希生成:
- link
- pHash.org
- Fred did an excellent write-up here
这篇关于从±200万张图像托架中查找并删除图像重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!