相似图片 - 如何比较它们 [英] Similar images - how to compare them

查看:180
本文介绍了相似图片 - 如何比较它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有超过130万的图像,我必须相互比较,每天增加几百个。

I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.

我的公司拍摄图像并创建一个我们的供应商可以使用的版本。

My company take an image and create a version that can be utilized by our vendors.

这些文件通常非常相似,例如两个不同的公司可以向我们发送两个不同的图像,JPG和GIF,两者都带有麦当劳标志,提交时间间隔数月。

The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.

最后我们发现自己创造了两个不同时间的相同标识只需复制/粘贴已创建的一个或至少建议它作为艺术家的一个可能的起点。

What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.

我已经四处寻找创建指纹的算法或者将允许我在上传新图像时进行简单查询,时间相对不是问题,如果需要1秒钟来创建指纹,创建指纹需要150天,但这将是一个很大的节省我们甚至可以得到3或4台服务器。

I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.

我精通PHP,但如果算法是伪代码甚至CI可以读取它并尝试翻译(除非它使用一些C特定的库)

I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)

目前我正在做所有图像的MD5来捕捉那些完全相同的图像,这个问题出现了我想要调整图像的大小并在调整大小的图像上运行md5以捕获以不同格式保存并重新调整大小的图像,但之后我仍然没有足够好的识别。

Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.

如果我没有提及它,我会很高兴看到可能出现类似图像的东西。

If I didn't mention it, I will be happy with something that just suggest possible "similar" images.

编辑

请记住,检查需要每分钟进行多次,因此最佳解决方案是为每个图像提供一些值我可以存储和使用,以便与我正在查看的图像进行比较,而无需重新扫描整个ser ver。

Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.

我正在阅读一些提到直方图的页面,或者将图像调整到非常小的尺寸,剥离可能的标签,然后将其转换为灰度,做哈希该文件并用于比较。如果我成功了,我会在这里发布代码/答案

I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here

推荐答案

尝试使用file_get_contents和:
http://www.php.net/manual/en/function.hash-file。 php

Try using file_get_contents and: http://www.php.net/manual/en/function.hash-file.php

如果哈希匹配,那么你知道它们是完全相同的。

If the hashes match, then you know they are the exact same.

编辑:
如果可能,我会认为存储图像哈希值,数据库表中的图像路径可能会帮助您限制服务器负载。在初始图像上运行哈希算法一次并将哈希存储在表中更容易...然后,当提交新图像时,您可以对图像进行哈希处理,然后在数据库表上进行查找。如果哈希已经丢弃它。您可以使用散列作为表索引,因此一旦找到匹配项,您就不需要检查其余的。

If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.

另一个选项是不使用数据库..但是你必须总是进行查找。这是检查传入图像的哈希,然后在内存中运行时间搜索所有保存的图像。

The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.

编辑#2:
请在此处查看解决方案:图像比较 - 快速算法

这篇关于相似图片 - 如何比较它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆