为LSH Minhash算法生成随机哈希函数 [英] Generating Random Hash Functions for LSH Minhash Algorithm

查看:339
本文介绍了为LSH Minhash算法生成随机哈希函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Java编写一个minhashing算法,它要求我生成任意数量的随机散列函数(在我的情况下为240个散列函数),并通过它运行任意数量的整数(目前为2000)。

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).

为了做到这一点,我已经为240个散列函数中的每一个生成随机数a,b和c(从1到2001的范围)。然后,我的哈希函数返回h =((a * x)+ b)%c,其中h是返回值,x是通过它运行的整数之一。

In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.

这是随机散列的有效实现,还是有更常见/可接受的方式来实现?

Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?

这篇文章提出了类似的问题,但我仍然有点困惑的答案措辞: Minhash实施如何找到排列的哈希函数

This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations

推荐答案

几年前,当我使用Bloom过滤器时,我碰到了一篇文章,描述了如何使用最少的代码非常简单地生成多个哈希函数。他描述的方法非常有效。请参阅减少散列,同样的性能:构建更好的绽放过滤器

When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.

基本思路是创建两个哈希函数,称之为 h1 h2 ,使用该公式,您可以使用公式模拟多个哈希函数 g1 gk

The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:

gi = h1(x) + i*h2(x)

i 从1到 k (从你想要的散列函数的数量。)

i varies from 1 to k (the number of hash functions you want).

即使你决定不实现他的想法,这篇论文也值得一读。虽然在阅读之后我无法想象不想实现它。它使我的Bloom过滤器代码更易于处理,并且不会对性能产生负面影响。

The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.

这篇关于为LSH Minhash算法生成随机哈希函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆