哈希函数来索引相似的文本 [英] hash function to index similar text
问题描述
我正在寻找一种哈希函数来索引相似的文本。例如,如果我们有两个非常长的文本,称为A和B,其中A和B的差别不大,那么应用于A和B的散列函数(称为H)应返回相同的数字。
所以H(A)= H(B)其中A和B是相似的文本。
我尝试了DoubleMetaphone (我使用意大利语言文字),但是我发现它依赖于字符串前缀非常强大。例如:
A =这是我想散列的非常长的文本
B =这是非常的
==> doubleMetaPhone(A)= doubleMetaPhone(B)
这对我来说并不是那么好,因为带有相同的前缀可以比较相似,我不希望这样。
任何人都可以用其他方式给我建议吗?
请参阅 http://en.wikipedia.org/wiki / Locality_sensitive_hashing
I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash" B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
这篇关于哈希函数来索引相似的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!