我该怎么办preserve键或索引输入到星火HashingTF()函数? [英] how do I preserve the key or index of input to Spark HashingTF() function?
问题描述
( https://开头spark.apache.org/docs/1.4.0/mllib-feature-extraction.html )我写的文本文档转换到值向量TF-IDF的例子。该举的例子说明如何可以做到这一点,但输入令牌的RDD的没有键的。这意味着,我的输出RDD不再包含索引或键指回原始文档。这个例子是这样的:
Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:
documents = sc.textFile("...").map(lambda line: line.split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
我愿做这样的事情:
I would like to do something like this:
documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
和在所得 TF
变量包含 UNIQUE_LINE_KEY
价值的地方。我只是缺少明显的东西?从例子似乎没有对文档链接没有什么好办法
RDD与 TF
RDD。
and have the resulting tf
variable contain the UNIQUE_LINE_KEY
value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document
RDD with the tf
RDD.
推荐答案
如果您使用的版本星火从后提交 85b96372cf0fd055f89fc639f45c1f2cb02a378f
(包括1.4),使用该API毫升HashingTF(需要输入数据帧而不是纯RDDS的),在其输出中原始列。希望帮助!
If you use a version of Spark from after commit 85b96372cf0fd055f89fc639f45c1f2cb02a378f
(this includes the 1.4), and use the ml API HashingTF (requires DataFrame input instead of plain RDDs), the original columns in its output. Hope that helps!
这篇关于我该怎么办preserve键或索引输入到星火HashingTF()函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!