我该怎么办preserve键或索引输入到星火HashingTF（）函数？ [英] how do I preserve the key or index of input to Spark HashingTF() function?

查看：198 发布时间：2016/5/22 16:21:24 apache-spark apache-spark-mllib tf-idf

本文介绍了我该怎么办preserve键或索引输入到星火HashingTF（）函数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

（ https：//开头spark.apache.org/docs/1.4.0/mllib-feature-extraction.html ）我写的文本文档转换到值向量TF-IDF的例子。该举的例子说明如何可以做到这一点，但输入令牌的RDD的没有键的。这意味着，我的输出RDD不再包含索引或键指回原始文档。这个例子是这样的：

Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:

documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

我愿做这样的事情：

I would like to do something like this:

documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

和在所得 TF 变量包含 UNIQUE_LINE_KEY 价值的地方。我只是缺少明显的东西？从例子似乎没有对文档链接没有什么好办法 RDD与 TF RDD。

and have the resulting tf variable contain the UNIQUE_LINE_KEY value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document RDD with the tf RDD.

我该怎么办preserve键或索引输入到星火HashingTF（）函数？ [英] how do I preserve the key or index of input to Spark HashingTF() function?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我该怎么办preserve键或索引输入到星火HashingTF（）函数？ [英] how do I preserve the key or index of input to Spark HashingTF() function?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭