如何保留 Spark HashingTF() 函数的输入键或索引? [英] how do I preserve the key or index of input to Spark HashingTF() function?

查看：23 发布时间：2021/11/14 21:09:04 apache-spark apache-spark-mllib tf-idf

本文介绍了如何保留 Spark HashingTF() 函数的输入键或索引?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基于 1.4 的 Spark 文档(https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) 我正在编写一个 TF-IDF 示例，用于将文本文档转换为值向量.给出的示例显示了如何做到这一点，但输入是一个 RDD，没有键.这意味着我的输出 RDD 不再包含索引或键来引用原始文档.例子是这样的:

Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:

documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

我想做这样的事情:

documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

并使结果 tf 变量在某处包含 UNIQUE_LINE_KEY 值.我只是错过了一些明显的东西吗?从示例来看，似乎没有什么好方法可以将 document RDD 与 tf RDD 链接起来.

and have the resulting tf variable contain the UNIQUE_LINE_KEY value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document RDD with the tf RDD.

如何保留 Spark HashingTF() 函数的输入键或索引? [英] how do I preserve the key or index of input to Spark HashingTF() function?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何保留 Spark HashingTF() 函数的输入键或索引? [英] how do I preserve the key or index of input to Spark HashingTF() function?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭