图像文件cheksum作为独特的内容比较优化 [英] Image file cheksum as a unique content compare optimalisation

查看:128
本文介绍了图像文件cheksum作为独特的内容比较优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用户将fotos上传到我们的php构建系统。其中一些我们因为不相关的内容而被标记为禁止。我正在寻找自动比较算法的最优化,该算法正在跳过这些被标记为禁止的照片。每次上传都需要与许多vorbinden进行比较。

Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.

可能的解决方案:

1 /存储禁止的文件并进行比较整个内容 - 效果很好但很慢。

1/ Store forbidden files and compare whole content - works well but is slow.

2 /存储图像文件校验和并比较校验和 - 这是提高速度的想法。

2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.

3 /任何智能算法,足够快,可以比较照片之间的相似性。但我在PHP中没有任何想法。

3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.

什么是最佳解决方案?

推荐答案

不计算校验和,计算哈希值!

Don't calculate checksums, calculate hashes!

我曾经创建了一个简单的应用程序,必须在我的硬盘上查找重复的图像。它只会搜索.JPG文件,但是对于每个文件,我会计算前1024个字节的哈希值,然后将图像的宽度,高度和大小附加到它以获得如下字符串:875234:640:480: 13286,我将其用作图像的关键。
事实证明,我没有看到任何与此算法有关的错误重复,尽管仍有可能出现重复错误。
但是,当有人只为其添加一个字节或对图像进行非常小的调整时,此方案将允许重复。

I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image. As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates. However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.

另一个技巧可能是减少每张图片的颜色大小和数量。如果将每个图像的大小调整为128x128像素并将颜色数量减少到16(4位),那么最终会得到合理的8192字节的独特模式。计算此模式的哈希值并使用哈希作为主键。一旦命中,您可能仍然存在误报,因此您需要将新图像的模式与系统中存储的模式进行比较。
如果第一个哈希解决方案指示新图像是唯一的,则可以使用此模式比较。不过,这仍然是我需要为自己的工具解决的问题。但它基本上是一种获取图像指纹然后比较它们。

Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system. This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.

我的第一个解决方案将找到完全匹配。我的第二个解决方案会找到类似的图(顺便说一下,我在Delphi中编写了我的哈希方法,但从技术上讲,任何哈希方法都足够好。)

My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)

这篇关于图像文件cheksum作为独特的内容比较优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆