PySpark布尔数据透视 [英] PySpark Boolean Pivot

查看:78
本文介绍了PySpark布尔数据透视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据模仿以下结构:

I have some data mimicking the following structure:

rdd = sc.parallelize(
    [
        (0,1), 
        (0,5), 
        (0,3), 
        (1,2), 
        (1,3), 
        (2,6)
    ]
)

df_data = sqlContext.createDataFrame(rdd, ["group","value"])

df_data.show()

+-----+-----+
|group|value|
+-----+-----+
|    0|    1|
|    0|    5|
|    0|    3|
|    1|    2|
|    1|    3|
|    2|    6|
+-----+-----+

我想做的是按组对数据进行透视,以显示值"值的存在,如下所示:

What I would like to do is to pivot this data by group to show the presence of the 'value' values as follows:

+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
|    0|   true|  false|   true|   true|  false|
|    1|  false|   true|   true|  false|  false|
|    2|  false|  false|  false|  false|   true|
+-----+-------+-------+-------+-------+-------+

我有什么办法可以用PySpark做到这一点吗?

Is there any way I could achieve this with PySpark?

我尝试了groupby/pivot/agg的组合,但没有成功.

I have tried a combination of groupby/pivot/agg without any success.

推荐答案

@Psidom的答案仅在Spark 2.3及更高版本上有效,因为 pyspark.sql.DataFrameNaFunctions 在以前的版本中不支持 bool .

@Psidom's answer will only work on Spark version 2.3 and higher as the pyspark.sql.DataFrameNaFunctions did not support bool in prior versions.

这是我在Spark 2.1中运行该代码时得到的:

This is what I get when I run that code in Spark 2.1:

import pyspark.sql.functions as F

(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
        .groupBy('group').pivot('value').agg(F.count('*').isNotNull())
        .na.fill(False).show())
#+-----+-------+-------+-------+-------+-------+
#|group|value_1|value_2|value_3|value_5|value_6|
#+-----+-------+-------+-------+-------+-------+
#|    0|   true|   null|   true|   true|   null|
#|    1|   null|   true|   true|   null|   null|
#|    2|   null|   null|   null|   null|   true|
#+-----+-------+-------+-------+-------+-------+


以下是适用于Spark 2.2及更低版本的替代解决方案:


Here is an alternative solution that should work for Spark 2.2 and lower:

# first pivot and fill nulls with 0
df = df_data.groupBy('group').pivot('value').count().na.fill(0)
df.show()
#+-----+---+---+---+---+---+
#|group|  1|  2|  3|  5|  6|
#+-----+---+---+---+---+---+
#|    0|  1|  0|  1|  1|  0|
#|    1|  0|  1|  1|  0|  0|
#|    2|  0|  0|  0|  0|  1|
#+-----+---+---+---+---+---+

现在使用 select 重命名列,并将值从 int 强制转换为 bool :

Now use select to rename the columns and cast the values from int to bool:

df.select(
    *[F.col(c) if c == 'group' else F.col(c).cast('boolean').alias('value_'+c) 
      for c in df.columns]
).show()
+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
|    0|   true|  false|   true|   true|  false|
|    1|  false|   true|   true|  false|  false|
|    2|  false|  false|  false|  false|   true|
+-----+-------+-------+-------+-------+-------+

这篇关于PySpark布尔数据透视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆