哈希函数来索引相似的文本 [英] hash function to index similar text

查看：145 发布时间：2018/6/1 18:50:24 hash similarity

本文介绍了哈希函数来索引相似的文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种哈希函数来索引相似的文本。例如，如果我们有两个非常长的文本，称为A和B，其中A和B的差别不大，那么应用于A和B的散列函数（称为H）应返回相同的数字。

所以H（A）= H（B）其中A和B是相似的文本。

我尝试了DoubleMetaphone （我使用意大利语言文字），但是我发现它依赖于字符串前缀非常强大。例如：

A =这是我想散列的非常长的文本
B =这是非常的

==> doubleMetaPhone（A）= doubleMetaPhone（B）

这对我来说并不是那么好，因为带有相同的前缀可以比较相似，我不希望这样。

任何人都可以用其他方式给我建议吗？

解决方案
请参阅 http://en.wikipedia.org/wiki / Locality_sensitive_hashing

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.

So H(A) = H(B) where A and B are similar text.

I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:

A = "This is the very long text that I want to hash" B = "This is the very"

==> doubleMetaPhone(A) = doubleMetaPhone(B)

And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.

Could anyone suggest me any other way?
解决方案
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing

这篇关于哈希函数来索引相似的文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

哈希函数来索引相似的文本 [英] hash function to index similar text

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哈希函数来索引相似的文本 [英] hash function to index similar text

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭