将 RDD 转换为列联表:Pyspark [英] Converting RDD to Contingency Table: Pyspark

查看:41
本文介绍了将 RDD 转换为列联表:Pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在尝试将 RDD 转换为 列联表,以便使用pyspark.ml.clustering.KMeans 模块,它将数据帧作为输入.

Currently I am trying to convert an RDD to a contingency table in-order to use the pyspark.ml.clustering.KMeans module, which takes a dataframe as input.

当我执行 myrdd.take(K) 时,(其中 K 是某个数字)结构如下所示:

When I do myrdd.take(K),(where K is some number) the structure looks as follows:

[[u'user1',('itm1',3),...,('itm2',1)], [u'user2',('itm1',7),...,('itm2',4)],...,[u'usern',('itm2',2),...,('itm3',10)]]

[[u'user1',('itm1',3),...,('itm2',1)], [u'user2',('itm1',7),..., ('itm2',4)],...,[u'usern',('itm2',2),...,('itm3',10)]]

其中每个列表包含一个实体作为第一个元素,以及该实体以元组形式喜欢的所有项目及其计数的集合.

Where each list contains an entity as the first element and the set of all items and their counts that was liked by this entity in the form of tuple.

现在,我的目标是将上述内容转换为类似于以下列联表的 spark DataFrame.

Now, my objective is to convert the above into a spark DataFrame that resembles the following contingency table.

+----------+------+----+-----+
|entity    |itm1  |itm2|itm3 |
+----------+------+----+-----+
|    user1 |     3|   1|    0|
|    user2 |     7|   4|    0|
|    usern |     0|   2|   10|
+----------+------+----+-----+

我使用了以下链接中引用的 df.stat.crosstab 方法:

I have used the df.stat.crosstab method as cited in the following link :

统计和Apache Spark 中带有 DataFrame 的数学函数 - 4. 交叉表(列联表)

而且它几乎接近我想要的.

and it is almost close to what I want.

但是如果上面的元组中还有一个计数字段,即 ('itm1',3) 如何合并(或添加)这个值 3列联表(或实体-项目矩阵)的最终结果.

But if there is one more count field like in the above tuple i.e., ('itm1',3) how to incorporate (or add) this value 3 into the final result of the contingency table (or entity-item matrix).

当然,我通过将上面的 RDD 列表转换为矩阵并将它们写入为 csv 文件,然后作为 DataFrame 读回,从而走了很长的路.

Of course, I take the long route by converting the above list of RDD into a matrix and write them as csv file and then read back as a DataFrame.

是否有使用 DataFrame 的更简单方法?

Is there a simpler way to do it using DataFrame ?

推荐答案

使用 createDataFrame() 方法将 RDD 转换为 pyspark 数据帧.

Convert RDD to pyspark dataframe by using createDataFrame() method.

在使用交叉表方法后使用显示方法.请参考以下示例:

Use show method after using crosstab method. Please refer following example:

cf = train_predictions.crosstab("prediction","label_col")

以表格格式显示:

cf.show()

输出:

+--------------------+----+----+
|prediction_label_col| 0.0| 1.0|
+--------------------+----+----+
|                 1.0| 752|1723|
|                 0.0|1830| 759|
+--------------------+----+----+

这篇关于将 RDD 转换为列联表:Pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆