最大和最小火花 [英] Max and Min of Spark

查看:78
本文介绍了最大和最小火花的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手,对SparkSQL中的聚合函数MAXMIN有一些疑问

I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL

在SparkSQL中,当我使用MAX/MIN函数时,仅返回MAX(value)/MIN(value). 但是,如果我还想要其他对应的列怎么办?

In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column?

例如给定一个具有timevaluelabel列的数据框,我怎么能得到timelabel分组的MIN(Value)?

For e.g. Given a dataframe with columns time, value and label, how can I get the time with the MIN(Value) grouped by label?

谢谢.

推荐答案

您需要先执行groupBy,然后再执行join,将其还原回原始的DataFrame.在Scala中,它看起来像这样:

You need to do a first do a groupBy, and then join that back to the original DataFrame. In Scala, it looks like this:

df.join(
  df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"), 
  $"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show

我不使用Python,但是看起来与上面的很接近.

I don't use Python, but it would look close to the above.

您甚至可以一次完成max()min():

You can even do max() and min() in one pass:

df.join(
  df.groupBy($"label")
    .agg(min($"value") as "min_value", max($"value") as "max_value")
    .withColumnRenamed("label", "r_label"), 
  $"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")

这篇关于最大和最小火花的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆