PySpark-如何转置数据框 [英] PySpark - How to transpose a Dataframe

查看:251
本文介绍了PySpark-如何转置数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想转置一个数据框。这只是我原始数据帧的一小部分摘录-

I want to transpose a dataframe. This is just a small excerpt from my original dataframe -

from pyspark.sql.functions import to_timestamp, date_format 
valuesCol = [('22','ABC Ltd','U.K.','class 1',102),('22','ABC Ltd','U.K.','class 2',73),('22','ABC Ltd','U.K.','class 3',92),
             ('51','Eric AB','Sweden','class 1',52),('51','Eric AB','Sweden','class 2',34),('51','Eric AB','Sweden','class 3',11)]
df = sqlContext.createDataFrame(valuesCol,['ID','Firm','Country','Class','Revenue'])
df.show()
+---+-------+-------+-------+-------+
| ID|   Firm|Country|  Class|Revenue|
+---+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|class 1|    102|
| 22|ABC Ltd|   U.K.|class 2|     73|
| 22|ABC Ltd|   U.K.|class 3|     92|
| 51|Eric AB| Sweden|class 1|     52|
| 51|Eric AB| Sweden|class 2|     34|
| 51|Eric AB| Sweden|class 3|     11|
+---+-------+-------+-------+-------+

PySpark 本身没有移调功能。一种实现必要结果的方法是,在 class1,class2和class3 上创建3个数据帧,然后加入(< c $ c>左加入)。但是,这可能涉及通过哈希值分区程序在网络上进行改组,并且代价很高。我敢肯定,应该有一个优雅而简单的方法。

There is no transpose function in PySpark as such. One way to achieve the requisite result is by creating 3 dataframes on class1, class2 and class3 and then joining (left join) them. But that could involve a reshuffle over the network, depending on hash partitioner, and is very costly. I am sure, there should be an elegant and a simple way.

期望的输出:

+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country| Class1| Class2| Class3|
+---+-------+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|    102|     73|     92|
| 51|Eric AB| Sweden|     52|     34|     11|
+---+-------+-------+-------+-------+-------+


推荐答案

链接。枢转时必须使用聚合函数,因为枢转始终与聚合相关。聚合函数可以是求和,计数,平均值,最小值或最大值,具体取决于所需的输出-

Courtesy this link. We have to use an aggregate function while pivoting, as pivoting is always in context to aggregation. Aggregation function could be sum, count, mean, min or max, depending upon the output desired -

df = df.groupBy(["ID","Firm","Country"]).pivot("Class").sum("Revenue")
df.show()
+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country|class 1|class 2|class 3|
+---+-------+-------+-------+-------+-------+
| 51|Eric AB| Sweden|     52|     34|     11|
| 22|ABC Ltd|   U.K.|    102|     73|     92|
+---+-------+-------+-------+-------+-------+

这篇关于PySpark-如何转置数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆