Spark 对 HashingTF 使用什么散列函数,我该如何复制它? [英] What hashing function does Spark use for HashingTF and how do I duplicate it?
问题描述
Spark MLLIb 有一个 HashingTF() 函数,它根据每个术语的散列值计算文档术语频率.
Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.
1) 它使用什么函数来进行散列?
1) what function does it use to do the hashing?
2) 如何从 Python 获得相同的散列值?
2) How can I achieve the same hashed value from Python?
3) 如果我想为给定的单个输入计算散列输出,而不计算词频,我该怎么做?
3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?
推荐答案
如果您有疑问通常检查 源.给定术语的存储桶确定如下:
If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:
def indexOf(self, term):
""" Returns the index of the input term. """
return hash(term) % self.numFeatures
正如你所看到的,它只是一个普通的 hash
模块数量的桶.
As you can see it is just a plain old hash
module number of buckets.
最终哈希只是每个桶的计数向量(为简洁起见,我省略了文档字符串和 RDD 案例):
Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):
def transform(self, document):
freq = {}
for term in document:
i = self.indexOf(term)
freq[i] = freq.get(i, 0) + 1.0
return Vectors.sparse(self.numFeatures, freq.items())
如果你想忽略频率,那么你可以使用 set(document)
作为输入,但我怀疑这里有很多收获.要创建 set
,您无论如何都必须为每个元素计算 hash
.
If you want to ignore frequencies then you can use set(document)
as an input, but I doubt there is much to gain here. To create set
you'll have to compute hash
for each element anyway.
这篇关于Spark 对 HashingTF 使用什么散列函数,我该如何复制它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!