如何在PySpark的DataFrame中按总和排序? [英] How could I order by sum, within a DataFrame in PySpark?
问题描述
类似地:
order_items.groupBy("order_item_order_id").count().orderBy(desc("count")).show()
我尝试过:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("sum")).show()
但这会导致错误:
Py4JJavaError:调用o501.sort时发生错误. :org.apache.spark.sql.AnalysisException:无法在给定输入列order_item_order_id,SUM(order_item_subtotal#429)的情况下解析"sum";
Py4JJavaError: An error occurred while calling o501.sort. : org.apache.spark.sql.AnalysisException: cannot resolve 'sum' given input columns order_item_order_id, SUM(order_item_subtotal#429);
我也尝试过:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal)")).show()
但是我得到了同样的错误:
but I get the same error:
Py4JJavaError:调用o512.sort时发生错误. :org.apache.spark.sql.AnalysisException:在给定输入列order_item_order_id,SUM(order_item_subtotal#429)的情况下,无法解析"SUM(order_item_subtotal)";
Py4JJavaError: An error occurred while calling o512.sort. : org.apache.spark.sql.AnalysisException: cannot resolve 'SUM(order_item_subtotal)' given input columns order_item_order_id, SUM(order_item_subtotal#429);
执行时我得到正确的结果:
I get the right result when executing:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal#429)")).show()
,但是在看到Spark附加到总和列名称(即#429 )后的后验中.
but this was done a posteriori, after having seen the number that Spark appends to the sum column name, i.e. #429.
是否有一种方法可以得到相同的结果,但先验,而又不知道将附加哪个数字?
Is there a way to get the same result but a priori, without knowing which number will be appended?
推荐答案
您应该为列使用别名:
import pyspark.sql.functions as func
order_items.groupBy("order_item_order_id")\
.agg(func.sum("order_item_subtotal")\
.alias("sum_column_name"))\
.orderBy("sum_column_name")
这篇关于如何在PySpark的DataFrame中按总和排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!