如何从Spark ML Lib中的TF Vector RDD获取单词详细信息? [英] How to get word details from TF Vector RDD in Spark ML Lib?

查看：88 发布时间：2020/9/3 23:34:54 apache-spark apache-spark-mllib tf-idf apache-spark-ml

本文介绍了如何从Spark ML Lib中的TF Vector RDD获取单词详细信息?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在Spark中使用HashingTF创建了术语频率.我已经使用每个单词的tf.transform来获得术语频率.

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word.

但是结果以这种格式显示.

But the results are showing in this format.

[<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...]
,[termFrequencyofWord1, termFrequencyOfWord2 ....]

例如:

(1048576,[105,3116],[1.0,2.0])

我可以使用tf.indexOf("word")在哈希存储桶中获取索引.

I am able to get the index in hash bucket, using tf.indexOf("word").

但是，如何使用索引来获取单词呢?

But, how can I get the word using the index?

推荐答案

好吧，你不能.由于散列是非单射的，因此没有反函数.换句话说，无数个令牌可以映射到一个存储桶，因此无法确定实际上有哪个令牌.

Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.

如果您使用的是较大的哈希值，并且唯一令牌的数量相对较少，则可以尝试创建一个从存储桶到数据集中可能存在的令牌的查找表.它是一对多映射，但是如果满足上述条件，则冲突数量应该相对较少.

If you're using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.

如果您需要可逆的转换，则可以结合使用 Tokenizer 和

If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.

另请参阅: Spark对HashingTF使用什么哈希函数，以及如何复制它?

修改:

在Spark 1.5+(PySpark 1.6+)中，您可以使用

In Spark 1.5+ (PySpark 1.6+) you can use CountVectorizer which applies reversible transformation and stores vocabulary.

Python:

from pyspark.ml.feature import CountVectorizer

df = sc.parallelize([
    (1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
]).toDF(["id", "tokens"])

vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
vectorizer.vocabulary
## ('foo', 'baz', 'bar', 'foobar')

scala:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = sc.parallelize(Seq(
    (1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
)).toDF("id", "tokens")

val model: CountVectorizerModel = new CountVectorizer()
  .setInputCol("tokens")
  .setOutputCol("features")
  .fit(df)

model.vocabulary
// Array[String] = Array(foo, baz, bar, foobar)

第0个位置的元素对应于索引0，第一个位置的元素对应于索引1，依此类推.

where element at the 0th position corresponds to index 0, element at the 1st position to index 1 and so on.

这篇关于如何从Spark ML Lib中的TF Vector RDD获取单词详细信息?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从Spark ML Lib中的TF Vector RDD获取单词详细信息? [英] How to get word details from TF Vector RDD in Spark ML Lib?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从Spark ML Lib中的TF Vector RDD获取单词详细信息? [英] How to get word details from TF Vector RDD in Spark ML Lib?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭