Spark 对 HashingTF 使用什么散列函数,我该如何复制它? [英] What hashing function does Spark use for HashingTF and how do I duplicate it?

查看:22
本文介绍了Spark 对 HashingTF 使用什么散列函数,我该如何复制它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark MLLIb 有一个 HashingTF() 函数,它根据每个术语的散列值计算文档术语频率.

Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.

1) 它使用什么函数来进行散列?

1) what function does it use to do the hashing?

2) 如何从 Python 获得相同的散列值?

2) How can I achieve the same hashed value from Python?

3) 如果我想为给定的单个输入计算散列输出,而不计算词频,我该怎么做?

3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?

推荐答案

如果您有疑问通常检查 .给定术语的存储桶确定如下:

If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

def indexOf(self, term):
    """ Returns the index of the input term. """
    return hash(term) % self.numFeatures

正如你所看到的,它只是一个普通的 hash 模块数量的桶.

As you can see it is just a plain old hash module number of buckets.

最终哈希只是每个桶的计数向量(为简洁起见,我省略了文档字符串和 RDD 案例):

Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

def transform(self, document):
    freq = {}
    for term in document:
        i = self.indexOf(term)
        freq[i] = freq.get(i, 0) + 1.0
    return Vectors.sparse(self.numFeatures, freq.items())

如果你想忽略频率,那么你可以使用 set(document) 作为输入,但我怀疑这里有很多收获.要创建 set,您无论如何都必须为每个元素计算 hash.

If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.

这篇关于Spark 对 HashingTF 使用什么散列函数,我该如何复制它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆