如何找到多列的最大值? [英] How to find the max value of multiple columns?

查看:41
本文介绍了如何找到多列的最大值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在 Spark 数据帧中找到多列的最大值.每个 Column 都有一个 double 类型的值.

I am trying to find the maximum value of multiple columns in a Spark dataframe. Each Column has a value of double type.

数据框是这样的:

+-----+---+----+---+---+
|Name | A | B  | C | D |
+-----+---+----+---+---+
|Alex |5.1|-6.2|  7|  8|
|John |  7| 8.3|  1|  2|
|Alice|  5|  46|  3|  2|
|Mark |-20| -11|-22| -5|
+-----+---+----+---+---+

期望是:

+-----+---+----+---+---+----------+
|Name | A | B  | C | D | MaxValue |
+-----+---+----+---+---+----------+
|Alex |5.1|-6.2|  7|  8|     8    |
|John |  7| 8.3|  1|  2|     8.3  | 
|Alice|  5|  46|  3|  2|     46   |
|Mark |-20| -11|-22| -5|     -5   |
+-----+---+----+---+---+----------+

推荐答案

您可以申请 greatest到数字列列表,如下图:

You could apply greatest to the list of numeric columns, as shown below:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("Alex", 5.1, -6.2, 7.0, 8.0),
  ("John", 7.0, 8.3, 1.0, 2.0),
  ("Alice", 5.0, 46.0, 3.0, 2.0),
  ("Mark", -20.0, -11.0, -22.0, -5.0),
).toDF("Name", "A", "B", "C", "D")

val numCols = df.columns.tail  // Apply suitable filtering as needed (*)

df.withColumn("MaxValue", greatest(numCols.head, numCols.tail: _*)).
  show
// +-----+-----+-----+-----+----+--------+
// | Name|    A|    B|    C|   D|MaxValue|
// +-----+-----+-----+-----+----+--------+
// | Alex|  5.1| -6.2|  7.0| 8.0|     8.0|
// | John|  7.0|  8.3|  1.0| 2.0|     8.3|
// |Alice|  5.0| 46.0|  3.0| 2.0|    46.0|
// | Mark|-20.0|-11.0|-22.0|-5.0|    -5.0|
// +-----+-----+-----+-----+----+--------+

(*) 例如,过滤所有顶级 DoubleType 列:

(*) For example, to filter for all top-level DoubleType columns:

import org.apache.spark.sql.types._

val numCols = df.schema.fields.collect{
  case StructField(name, DoubleType, _, _) => name
}

如果您使用 Spark 2.4+,另一种方法是使用 array_max,虽然在这种情况下会涉及额外的转换步骤:

If you're on Spark 2.4+, an alternative would be to use array_max, although it would involve an additional step of transformation in this case:

df.withColumn("MaxValue", array_max(array(numCols.map(col): _*)))

这篇关于如何找到多列的最大值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆