我该怎么办preserve键或索引输入到星火HashingTF()函数? [英] how do I preserve the key or index of input to Spark HashingTF() function?

查看:198
本文介绍了我该怎么办preserve键或索引输入到星火HashingTF()函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

https://开头spark.apache.org/docs/1.4.0/mllib-feature-extraction.html )我写的文本文档转换到值向量TF-IDF的例子。该举的例子说明如何可以做到这一点,但输入令牌的RDD的没有键的。这意味着,我的输出RDD不再包含索引或键指回原始文档。这个例子是这样的:

Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:

documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

我愿做这样的事情:

I would like to do something like this:

documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

和在所得 TF 变量包含 UNIQUE_LINE_KEY 价值的地方。我只是缺少明显的东西?从例子似乎没有对文档链接没有什么好办法 RDD与 TF RDD。

and have the resulting tf variable contain the UNIQUE_LINE_KEY value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document RDD with the tf RDD.

推荐答案

如果您使用的版本星火从后提交 85b96372cf0fd055f89fc639f45c1f2cb02a378f (包括1.4),使用该API毫升HashingTF(需要输入数据帧而不是纯RDDS的),在其输出中原始列。希望帮助!

If you use a version of Spark from after commit 85b96372cf0fd055f89fc639f45c1f2cb02a378f (this includes the 1.4), and use the ml API HashingTF (requires DataFrame input instead of plain RDDs), the original columns in its output. Hope that helps!

这篇关于我该怎么办preserve键或索引输入到星火HashingTF()函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆