PySpark布尔数据透视 [英] PySpark Boolean Pivot
问题描述
我有一些数据模仿以下结构:
I have some data mimicking the following structure:
rdd = sc.parallelize(
[
(0,1),
(0,5),
(0,3),
(1,2),
(1,3),
(2,6)
]
)
df_data = sqlContext.createDataFrame(rdd, ["group","value"])
df_data.show()
+-----+-----+
|group|value|
+-----+-----+
| 0| 1|
| 0| 5|
| 0| 3|
| 1| 2|
| 1| 3|
| 2| 6|
+-----+-----+
我想做的是按组对数据进行透视,以显示值"值的存在,如下所示:
What I would like to do is to pivot this data by group to show the presence of the 'value' values as follows:
+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
| 0| true| false| true| true| false|
| 1| false| true| true| false| false|
| 2| false| false| false| false| true|
+-----+-------+-------+-------+-------+-------+
我有什么办法可以用PySpark做到这一点吗?
Is there any way I could achieve this with PySpark?
我尝试了groupby/pivot/agg的组合,但没有成功.
I have tried a combination of groupby/pivot/agg without any success.
推荐答案
@Psidom的答案仅在Spark 2.3及更高版本上有效,因为 pyspark.sql.DataFrameNaFunctions
在以前的版本中不支持 bool
.
@Psidom's answer will only work on Spark version 2.3 and higher as the pyspark.sql.DataFrameNaFunctions
did not support bool
in prior versions.
这是我在Spark 2.1中运行该代码时得到的:
This is what I get when I run that code in Spark 2.1:
import pyspark.sql.functions as F
(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
.groupBy('group').pivot('value').agg(F.count('*').isNotNull())
.na.fill(False).show())
#+-----+-------+-------+-------+-------+-------+
#|group|value_1|value_2|value_3|value_5|value_6|
#+-----+-------+-------+-------+-------+-------+
#| 0| true| null| true| true| null|
#| 1| null| true| true| null| null|
#| 2| null| null| null| null| true|
#+-----+-------+-------+-------+-------+-------+
以下是适用于Spark 2.2及更低版本的替代解决方案:
Here is an alternative solution that should work for Spark 2.2 and lower:
# first pivot and fill nulls with 0
df = df_data.groupBy('group').pivot('value').count().na.fill(0)
df.show()
#+-----+---+---+---+---+---+
#|group| 1| 2| 3| 5| 6|
#+-----+---+---+---+---+---+
#| 0| 1| 0| 1| 1| 0|
#| 1| 0| 1| 1| 0| 0|
#| 2| 0| 0| 0| 0| 1|
#+-----+---+---+---+---+---+
现在使用 select
重命名列,并将值从 int
强制转换为 bool
:
Now use select
to rename the columns and cast the values from int
to bool
:
df.select(
*[F.col(c) if c == 'group' else F.col(c).cast('boolean').alias('value_'+c)
for c in df.columns]
).show()
+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
| 0| true| false| true| true| false|
| 1| false| true| true| false| false|
| 2| false| false| false| false| true|
+-----+-------+-------+-------+-------+-------+
这篇关于PySpark布尔数据透视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!