如何使用withColumn计算列中的最大值? [英] How to compute the largest value in a column using withColumn?
问题描述
我正在尝试计算Spark 1.6.1 中以下DataFrame的最大值:
< pre class = lang-scala prettyprint-override>
val df = sc.parallelize(Seq(1,2,3))。toDF( id)
第一种方法是选择最大值,并且可以按预期工作:
< pre class = lang-scala prettyprint-override>
df.select(max($ id))。show
第二种方法可能是使用 withColumn
如下:
df.withColumn( max,max($ id))。show
但不幸的是,它失败并显示以下错误消息:
org.apache.spark .sql.AnalysisException:表达式'id'既不是
在group by中,也不是聚合函数。如果您不在乎获得的值
,则将其添加到组
或包装first()(或first_value)。;
如何在没有任何 Window
或<$的情况下计算 withColumn
函数中的最大值c $ c> groupBy ?如果不可能,如何在特定情况下使用 Window
?
正确的方法是将聚合计算为单独的查询,并与实际结果结合。与窗口函数不同,在此处的许多答案中都建议使用该函数,它不需要对单个分区进行随机组合,并且适用于大型数据集。
可以做到 withColumn
使用单独的操作:
import org。 apache.spark.sql.functions。{lit,max}
df.withColumn( max,lit(df.agg(max($ id))。as [Int] .first ))
,但是使用显式方法更干净:
import org.apache.spark.sql.functions.broadcast
df.crossJoin(broadcast(df。 agg(max($ id)as max))))
或隐式交叉连接:
spark.conf.set( spark.sql.crossJoin.enabled,true)
df.join(广播(df.agg(max($ id)as max))))
I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:
val df = sc.parallelize(Seq(1,2,3)).toDF("id")
A first approach would be to select the maximum value, and it works as expected:
df.select(max($"id")).show
The second approach could be to use withColumn
as follows:
df.withColumn("max", max($"id")).show
But unfortunately it fails with the following error message:
org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
How can I compute the maximum value in a withColumn
function without any Window
or groupBy
? If not possible, how can I do it in this specific case using a Window
?
The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.
It could be done withColumn
using a separate action:
import org.apache.spark.sql.functions.{lit, max}
df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))
but it is much cleaner to use either explicit:
import org.apache.spark.sql.functions.broadcast
df.crossJoin(broadcast(df.agg(max($"id") as "max")))
or implicit cross join:
spark.conf.set("spark.sql.crossJoin.enabled", true)
df.join(broadcast(df.agg(max($"id") as "max")))
这篇关于如何使用withColumn计算列中的最大值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!