pyspark:groupby和聚合avg，第一列在多列上 [英] pyspark: groupby and aggregate avg and first on multiple columns

查看：64 发布时间：2021/4/8 20:27:18 pyspark apache-spark-sql

本文介绍了pyspark:groupby和聚合avg，第一列在多列上的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个以下示例pyspark数据帧，在groupby之后，我想计算均值，并且是多列中的第一列，在实际情况下，我有100列，所以我不能单独进行操作

I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually

sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'],
                        ['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4'])

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  a|   2|   4|  cc| anc|
|  a|   4|   7|  cd| abc|
|  b|   6|   0|  as| asd|
|  b|   2|   4|  ad| acb|
|  c|   4|   4|  sd| acc|
+---+----+----+----+----+

这就是我正在尝试的

mean_cols = ['col1', 'col2']
first_cols = ['col3', 'col4']
sc.groupby('id').agg(*[ f.mean for col in mean_cols], *[f.first for col in first_cols])

但是它不起作用.如何使用pyspark做到这一点

but it's not working. How can I do it like this with pyspark

推荐答案

在多列上使用多个功能的最佳方法是使用.agg(* expr)格式.

The best way for multiple functions on multiple columns is to use the .agg(*expr) format.

import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
fn_l = [F.min,F.max,F.mean,F.first]
col_l=['col1','col2','col3']
expr = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
tst_r = tst.groupby('col4').agg(*expr)

结果将是

tst_r.show()
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
|col4|min_col1|min_col2|min_col3|max_col1|max_col2|max_col3|mean_col1|mean_col2|mean_col3|first_col1|first_col2|first_col3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
|   5|       5|       6|       7|       7|       8|       9|      6.0|      7.0|      8.0|         5|         6|         7|
|   4|       1|       2|       3|       3|       4|       5|      2.0|      3.0|      4.0|         1|         2|         3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+

要有选择地在列上应用函数，可以有多个表达式数组并将它们串联在一起.

For selectively applying functions on columns, you can have multiple expression arrays and concatenate them in aggregation.

fn_l = [F.min,F.max]
fn_2=[F.mean,F.first]
col_l=['col1','col2']
col_2=['col1','col3','col4']
expr1 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
expr2 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_2 for coln in col_2]
tst_r = tst.groupby('col4').agg(*(expr1+expr2))

这篇关于pyspark:groupby和聚合avg，第一列在多列上的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark:groupby和聚合avg，第一列在多列上 [英] pyspark: groupby and aggregate avg and first on multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark:groupby和聚合avg，第一列在多列上 [英] pyspark: groupby and aggregate avg and first on multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭