字符串相似性得分/哈希 [英] String similarity score/hash

查看:177
本文介绍了字符串相似性得分/哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法来计算像一串一般的相似性得分?在我不是比较两个字符串连接在一起,而是我得到一些数字(哈希)对每个字符串,后来告诉我,两个字符串或不相似的方式。两个类似的字符串应该有类似(接近)哈希值。

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

让我们考虑这些字符串和分数作为一个例子:

Let's consider these strings and scores as an example:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

您可以看到,世界,你好!世界,你好是相似的,他们的分数都接近对方。

You can see that Hello world! and Hello world are similar and their scores are close to each other.

这种方式,找到最相似的字符串,以给定的字符串将被减去给出的字符串的得分来自对方得分,然后排序他们的绝对值来完成。

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

推荐答案

我相信你在找什么叫做当地敏感哈希。而大多数散列算法被设计成使得在输入的小变化会引起大的变化的输出,这些散列尝试相反:在输入小的变化产生在输出成比例的微小变化

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

正如其他人所提到的,有与强迫多维映射到一个二维绘图的固有问题。它类似于创建地球的平面地图......你不能准确地重新present球体在一个平面上。最好你能做的就是找​​到是出于某种特性是你正在使用,以确定字符串是否相似而优化的LSH。

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

这篇关于字符串相似性得分/哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆