使用SOLR计算“相似度"/“bitcount"在两个 ulong 之间 [英] Using SOLR to calculate "similarity"/"bitcount" between two ulongs

查看:35
本文介绍了使用SOLR计算“相似度"/“bitcount"在两个 ulong 之间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个图像数据库,我在其中使用 博士Neal Krawetz 的方法David Oftedal 实现.

We have a database of images where I have calculated the PHASH using Dr. Neal Krawetz's method as implemented by David Oftedal.

部分示例代码计算这些 long 之间的差异在这里:

Part of the sample code calculates the difference between these longs is here:

ulong hash1 = AverageHash(theImage);
ulong hash2 = AverageHash(theOtherImage);

uint BitCount(ulong theNumber)
{
    uint count = 0;
    for (; theNumber > 0; theNumber >>= 8) {
        count += bitCounts[(theNumber & 0xFF)];
    }
    return count;
}

Console.WriteLine("Similarity: " + ((64 - BitCount(hash1 ^ hash2)) * 100.0) / 64.0 + "%");

挑战在于我只知道这些散列中的一个,我想查询 SOLR 以按相似性的顺序查找其他散列.

The challenge is that I only know one of these hashes and I want to query SOLR to find other hashes in order of similarity.

一些注意事项:

  1. 在这里使用 SOLR(我唯一的选择是 HBASE)
  2. 希望避免在 solr 中安装任何自定义 java(很高兴安装现有插件)
  3. 很高兴在 C# 中进行大量预处理
  4. 很高兴使用多个字段将数据存储为位串、长串等
  5. 使用 SOLRNet 作为客户端

编辑,一些额外的信息(抱歉我陷入了这个问题并开始假设它是一个广为人知的领域).这里是直接下载到 C# 控制台/示例应用程序:http://01101001.net/Imghash.zip

Edit, some extra information (apologies I am caught up in the problem and started assuming it was a widely known area). Here is a direct download to the C# console / sample app: http://01101001.net/Imghash.zip

此控制台应用程序的示例输出为:

An example output of this console app would be:

004143737f7f7f7f phash-test-001.jpg
0041417f7f7f7f7f phash-test-002.jpg
相似度:95.3125%

004143737f7f7f7f phash-test-001.jpg
0041417f7f7f7f7f phash-test-002.jpg
Similarity: 95.3125%

推荐答案

您可以使用 Solr 的模糊搜索为此,您必须在页面上向下滚动一点.

You can use Solr's Fuzzy Search for this, you have to scroll down a bit on the page.

Solr 的标准查询解析器支持基于 Levenshtein 距离或编辑距离算法的模糊搜索.模糊搜索发现与指定术语相似但不一定完全匹配的术语.要执行模糊搜索,请在单个词项的末尾使用波浪号 ~ 符号.

Solr's standard query parser supports fuzzy searches based on the Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term.

假设您有一个如下所示的架构,其中该字段 phash 包含您计算的 phash.

Assuming you have a schema like below, where this field phash holds the phash you have calculated.

<fields>
    <!-- ... all your other fields ... -->
    <field name="phash" type="string" indexed="true" stored="true" />
</fields>

您可以执行类似的查询

q=phash:004143737f7f7f7f~0.8&
fl=score,phash

这将返回具有 Levenshtein 距离或编辑距离 至少 80% 的 PHASH 的所有文档.您不会得到您在问题中给出的 95.3125%,但会计算 87.5% 作为匹配/不匹配字符.

This will return all documents that have a PHASH with a Levenshtein Distance or Edit Distance of at least 80%. You will not get the 95.3125% you have given in your question, but a 87,5% as matching/not matching characters are counted.

当您想查看该值时,您可以执行以下查询

When you want to see that value, you may perform the following query

q=phash:004143737f7f7f7f~0.8&
fl=score,phash,strdist("0041417f7f7f7f7f", phash, edit)

这是一个获取字符串距离的函数调用 使用 Levenstein 或 Edit 距离并将提供类似于

This is a function call to fetch the String Distance using the Levenstein or Edit distance and will deliver a result similar to

+----------------+---------------------------------------+
|hash            |strdist("0041417f7f7f7f7f", hash, edit)|
+----------------+---------------------------------------+
|0041417f7f7f7f7f|1.0                                    |
+----------------+---------------------------------------+
|004143737f7f7f7f|0.875                                  |
+----------------+---------------------------------------+

当您想缩小 95.3125%87,5% 之间的差距时,您应该考虑将 PHASH 存储为例如八进制而不是十六进制值.

When you want to reduce the gap between 95.3125% and 87,5% you should consider to store the PHASH not as hexadecimal value, but as octal for instance.

这篇关于使用SOLR计算“相似度"/“bitcount"在两个 ulong 之间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆