Pyspark:重塑数据而无需聚合 [英] Pyspark: reshape data without aggregation

查看:62
本文介绍了Pyspark:重塑数据而无需聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在pyspark中将数据从4x3重塑为2x2,而不进行汇总.我当前的输出如下:

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:

columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
    (1, 0, 141),
    (0, 0, 140),
    (1, 1, 21),
    (0, 1, 12)
]

我想要的是一个列联表,其中第二列为两个新的二进制列( value_HIGH_1 value_HIGH_0 )和 count 列-含义:

What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:

columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
    (1, 21, 141),
    (0, 12, 140)
]

推荐答案

您可以将 pivot fake最大聚合次数结合使用(因为每个组只有一个元素):

You can use pivot with a fake maximum aggregation (since you have only one element for each group):

import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
    'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
|     0|          12|         140|
|     1|          21|         141|
+------+------------+------------+

这篇关于Pyspark:重塑数据而无需聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆