为什么 PySpark 中的 agg() 一次只能汇总一列? [英] Why agg() in PySpark is only able to summarize one column at a time?

查看：30 发布时间：2021/11/14 22:19:39 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了为什么 PySpark 中的 agg() 一次只能汇总一列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于下面的数据框

df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])

当我试图找到 min &max 我只在输出中获得最小值.

df.agg({'High':'max','High':'min'}).show()

+-----------+|分钟(高)|+-----------+|2094900|+-----------+

为什么 agg() 不能同时给出 max &像 Pandas 一样吗?

解决方案

如您所见这里:

<块引用>

agg(*exprs)

Compute 聚合并将结果作为 DataFrame 返回.

可用的聚合函数有 avg、max、min、sum、count.

如果 exprs 是从字符串到字符串的单个 dict 映射，则键是执行聚合的列，值是聚合函数.

或者，exprs 也可以是聚合列表达式的列表.

参数:exprs – 从列名(字符串)到聚合函数(字符串)或列列表的字典映射.

您可以使用列列表并在每一列上应用您需要的功能，如下所示:

<预><代码>>>>from pyspark.sql 导入函数为 F>>>df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()+---------+---------+---------+---------+|min(High)|max(High)|avg(High)|sum(High)|+---------+---------+---------+---------+|4.3|7.677|5.9885|11.977|+---------+---------+---------+---------+

For the below dataframe

df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])

When I try to find min & max I am only getting min value in output.

df.agg({'High':'max','High':'min'}).show()

+-----------+
|min(High)  |
+-----------+
|    2094900|
+-----------+

Why can't agg() give both max & min like in Pandas?

解决方案

As you can see here:

agg(*exprs)

Compute aggregates and returns the result as a DataFrame.

The available aggregate functions are avg, max, min, sum, count.

If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.

Alternatively, exprs can also be a list of aggregate Column expressions.

Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.

You can use a list of column and apply the function that you need on every column, like this:

>>> from pyspark.sql import functions as F

>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
+---------+---------+---------+---------+
|min(High)|max(High)|avg(High)|sum(High)|
+---------+---------+---------+---------+
|      4.3|    7.677|   5.9885|   11.977|
+---------+---------+---------+---------+

这篇关于为什么 PySpark 中的 agg() 一次只能汇总一列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么 PySpark 中的 agg() 一次只能汇总一列? [英] Why agg() in PySpark is only able to summarize one column at a time?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么 PySpark 中的 agg() 一次只能汇总一列? [英] Why agg() in PySpark is only able to summarize one column at a time?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭