字符串相似性得分/哈希 [英] String similarity score/hash
问题描述
有没有一种方法来计算像一串一般的相似性得分?在我不是比较两个字符串连接在一起,而是我得到一些数字(哈希)对每个字符串,后来告诉我,两个字符串或不相似的方式。两个类似的字符串应该有类似(接近)哈希值。
Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
让我们考虑这些字符串和分数作为一个例子:
Let's consider these strings and scores as an example:
Hello world 1000
Hello world! 1010
Hello earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
Foo world! 2350
您可以看到,世界,你好!和世界,你好是相似的,他们的分数都接近对方。
You can see that Hello world! and Hello world are similar and their scores are close to each other.
这种方式,找到最相似的字符串,以给定的字符串将被减去给出的字符串的得分来自对方得分,然后排序他们的绝对值来完成。
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
推荐答案
我相信你在找什么叫做当地敏感哈希。而大多数散列算法被设计成使得在输入的小变化会引起大的变化的输出,这些散列尝试相反:在输入小的变化产生在输出成比例的微小变化
I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
正如其他人所提到的,有与强迫多维映射到一个二维绘图的固有问题。它类似于创建地球的平面地图......你不能准确地重新present球体在一个平面上。最好你能做的就是找到是出于某种特性是你正在使用,以确定字符串是否相似而优化的LSH。
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".
这篇关于字符串相似性得分/哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!