字符串相似度得分/哈希 [英] String similarity score/hash

查看:37
本文介绍了字符串相似度得分/哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以计算字符串的一般相似度分数"?在某种程度上,我不是将两个字符串比较在一起,而是为每个字符串获取一些数字(哈希),稍后可以告诉我两个字符串相似或不相似.两个相似的字符串应该具有相似(接近)的哈希值.

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

让我们以这些字符串和分数为例:

Let's consider these strings and scores as an example:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

你可以看到Hello world!Hello world很相似,他们的分数也很接近.

You can see that Hello world! and Hello world are similar and their scores are close to each other.

通过这种方式,可以通过从其他分数中减去给定字符串分数,然后对它们的绝对值进行排序来找到与给定字符串最相似的字符串.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

推荐答案

我相信你正在寻找的是一个 局部敏感哈希.大多数散列算法的设计都使得输入的微小变化会导致输出发生较大变化,而这些散列尝试相反:输入的微小变化会相应地产生输出的微小变化.

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

正如其他人所提到的,将多维映射强制转换为二维映射存在固有问题.这类似于创建地球的平面地图……您永远无法准确地表示平面上的球体.您能做的最好的事情是找到一个 LSH,它针对您用来确定字符串是否相似"的任何功能进行了优化.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

这篇关于字符串相似度得分/哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆