Spark 对 HashingTF 使用什么散列函数，我该如何复制它? [英] What hashing function does Spark use for HashingTF and how do I duplicate it?

查看：22 发布时间：2021/11/14 21:00:19 python hash apache-spark pyspark apache-spark-mllib

本文介绍了Spark 对 HashingTF 使用什么散列函数，我该如何复制它?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark MLLIb 有一个 HashingTF() 函数，它根据每个术语的散列值计算文档术语频率.

Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.

1) 它使用什么函数来进行散列?

1) what function does it use to do the hashing?

2) 如何从 Python 获得相同的散列值?

2) How can I achieve the same hashed value from Python?

3) 如果我想为给定的单个输入计算散列输出，而不计算词频，我该怎么做?

3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?

推荐答案

如果您有疑问通常检查源.给定术语的存储桶确定如下:

If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

def indexOf(self, term):
    """ Returns the index of the input term. """
    return hash(term) % self.numFeatures

正如你所看到的，它只是一个普通的 hash 模块数量的桶.

As you can see it is just a plain old hash module number of buckets.

最终哈希只是每个桶的计数向量(为简洁起见，我省略了文档字符串和 RDD 案例):

Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

def transform(self, document):
    freq = {}
    for term in document:
        i = self.indexOf(term)
        freq[i] = freq.get(i, 0) + 1.0
    return Vectors.sparse(self.numFeatures, freq.items())

如果你想忽略频率，那么你可以使用 set(document) 作为输入，但我怀疑这里有很多收获.要创建 set，您无论如何都必须为每个元素计算 hash.

If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.

这篇关于Spark 对 HashingTF 使用什么散列函数，我该如何复制它?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark 对 HashingTF 使用什么散列函数，我该如何复制它? [英] What hashing function does Spark use for HashingTF and how do I duplicate it?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark 对 HashingTF 使用什么散列函数，我该如何复制它? [英] What hashing function does Spark use for HashingTF and how do I duplicate it?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭