哈希函数可以在相同的桶中散列相似的字符串 [英] Hash function that hashes similar strings in the same bucket

查看:145
本文介绍了哈希函数可以在相同的桶中散列相似的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个坏的散列函数:
我想散列字符串并将相似的字符串放入一个存储桶中。



可以你给我一个暗示从哪里开始我的研究?
一些方法或算法名称...

解决方案

你的问题并不容易。两个想法:

这个解决方案可能过于复杂,但您可以尝试傅里叶变换。将输入文本视为一系列函数样本,然后运行傅里叶变换将输入转换为频域。低频部分是文本的一般jist,高频部分是微小的变化。



这与jpeg压缩有些类似:丢弃细节只留下重要的东西。如果你有两个几乎完全相同的图像,并且你用jpeg压缩它们,你通常会获得相同的输出。



pHash使用类似于此的方法。 b
$ b

再次,这将是一个非常复杂的方式。

第二个想法:minHash



minHash的想法是,您可以在输入相同时选择一些可能相同的标记。然后计算所有标记输出的矢量。如果两个输入具有相似的向量,则输入类似。例如,计算单词the出现在文本中的次数。如果它是偶数,则为0,如果它很奇怪,1.现在计算文本中出现数学一词的次数。同样,0表示偶数,1表示奇数。这样做的话很多。



现在你处理所有的文本,每个文本给你一个输出,如011100010101或其他。如果两个文本相似,那么它们将具有相似的输出字符串,仅相差1或2位。您可以使用多变量分区线索(MVP)来有效地搜索输出。



这也可能会对您的问题产生影响。


I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket.

Can you give me a hint where to start my research? Some methods or algorithm names...

解决方案

Your problem is not an easy one. Two ideas:

This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.

This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.

pHash uses a method similar to this.

Again, this is going to be a pretty complicated way to do it.

Second idea: minHash

The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.

For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.

Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.

This, too, might be overkill for your problem.

这篇关于哈希函数可以在相同的桶中散列相似的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆