PySpark-如何转置数据框 [英] PySpark - How to transpose a Dataframe
问题描述
我想转置一个数据框。这只是我原始数据帧的一小部分摘录-
I want to transpose a dataframe. This is just a small excerpt from my original dataframe -
from pyspark.sql.functions import to_timestamp, date_format
valuesCol = [('22','ABC Ltd','U.K.','class 1',102),('22','ABC Ltd','U.K.','class 2',73),('22','ABC Ltd','U.K.','class 3',92),
('51','Eric AB','Sweden','class 1',52),('51','Eric AB','Sweden','class 2',34),('51','Eric AB','Sweden','class 3',11)]
df = sqlContext.createDataFrame(valuesCol,['ID','Firm','Country','Class','Revenue'])
df.show()
+---+-------+-------+-------+-------+
| ID| Firm|Country| Class|Revenue|
+---+-------+-------+-------+-------+
| 22|ABC Ltd| U.K.|class 1| 102|
| 22|ABC Ltd| U.K.|class 2| 73|
| 22|ABC Ltd| U.K.|class 3| 92|
| 51|Eric AB| Sweden|class 1| 52|
| 51|Eric AB| Sweden|class 2| 34|
| 51|Eric AB| Sweden|class 3| 11|
+---+-------+-------+-------+-------+
PySpark
本身没有移调功能。一种实现必要结果的方法是,在 class1,class2和class3
上创建3个数据帧
,然后加入(< c $ c>左加入)。但是,这可能涉及通过哈希值分区程序在网络上进行改组,并且代价很高。我敢肯定,应该有一个优雅而简单的方法。
There is no transpose function in PySpark
as such. One way to achieve the requisite result is by creating 3 dataframes
on class1, class2 and class3
and then joining (left join
) them. But that could involve a reshuffle over the network, depending on hash partitioner, and is very costly. I am sure, there should be an elegant and a simple way.
期望的输出:
+---+-------+-------+-------+-------+-------+
| ID| Firm|Country| Class1| Class2| Class3|
+---+-------+-------+-------+-------+-------+
| 22|ABC Ltd| U.K.| 102| 73| 92|
| 51|Eric AB| Sweden| 52| 34| 11|
+---+-------+-------+-------+-------+-------+
推荐答案
此链接。枢转时必须使用聚合函数,因为枢转始终与聚合相关。聚合函数可以是求和,计数,平均值,最小值或最大值,具体取决于所需的输出-
Courtesy this link. We have to use an aggregate function while pivoting, as pivoting is always in context to aggregation. Aggregation function could be sum, count, mean, min or max, depending upon the output desired -
df = df.groupBy(["ID","Firm","Country"]).pivot("Class").sum("Revenue")
df.show()
+---+-------+-------+-------+-------+-------+
| ID| Firm|Country|class 1|class 2|class 3|
+---+-------+-------+-------+-------+-------+
| 51|Eric AB| Sweden| 52| 34| 11|
| 22|ABC Ltd| U.K.| 102| 73| 92|
+---+-------+-------+-------+-------+-------+
这篇关于PySpark-如何转置数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!