如何使用withColumn计算列中的最大值？ [英] How to compute the largest value in a column using withColumn?

查看：267 发布时间：2020/6/2 20:48:05 apache-spark dataframe apache-spark-sql aggregate-functions

本文介绍了如何使用withColumn计算列中的最大值？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试计算Spark 1.6.1 中以下DataFrame的最大值：

< pre class = lang-scala prettyprint-override>

 val df = sc.parallelize（Seq（1,2,3））。toDF（ id）

第一种方法是选择最大值，并且可以按预期工作：

< pre class = lang-scala prettyprint-override>

 df.select（max（$ id））。show

第二种方法可能是使用 withColumn 如下：

  df.withColumn（ max，max（$ id））。show

但不幸的是，它失败并显示以下错误消息：

org.apache.spark .sql.AnalysisException：表达式'id'既不是
在group by中，也不是聚合函数。如果您不在乎获得的值
，则将其添加到组
或包装first（）（或first_value）。;

如何在没有任何 Window 或<$的情况下计算 withColumn 函数中的最大值c $ c> groupBy ？如果不可能，如何在特定情况下使用 Window ？

解决方案

正确的方法是将聚合计算为单独的查询，并与实际结果结合。与窗口函数不同，在此处的许多答案中都建议使用该函数，它不需要对单个分区进行随机组合，并且适用于大型数据集。

可以做到 withColumn 使用单独的操作：

  import org。 apache.spark.sql.functions。{lit，max} 
 
 df.withColumn（ max，lit（df.agg（max（$ id））。as [Int] .first ））

，但是使用显式方法更干净：

  import org.apache.spark.sql.functions.broadcast 
 
 df.crossJoin（broadcast（df。 agg（max（$ id）as max））））

或隐式交叉连接：

  spark.conf.set（ spark.sql.crossJoin.enabled，true）
 
 df.join（广播（df.agg（max（$ id）as max））））

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:

val df = sc.parallelize(Seq(1,2,3)).toDF("id")

A first approach would be to select the maximum value, and it works as expected:

df.select(max($"id")).show

The second approach could be to use withColumn as follows:

df.withColumn("max", max($"id")).show

But unfortunately it fails with the following error message:

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?

解决方案

The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.

It could be done withColumn using a separate action:

import org.apache.spark.sql.functions.{lit, max}

df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))

but it is much cleaner to use either explicit:

import org.apache.spark.sql.functions.broadcast

df.crossJoin(broadcast(df.agg(max($"id") as "max")))

or implicit cross join:

spark.conf.set("spark.sql.crossJoin.enabled", true)

df.join(broadcast(df.agg(max($"id") as "max")))

这篇关于如何使用withColumn计算列中的最大值？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用withColumn计算列中的最大值？ [英] How to compute the largest value in a column using withColumn?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用withColumn计算列中的最大值？ [英] How to compute the largest value in a column using withColumn?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭