Scala/Spark数据框:找到与最大值对应的列名称 [英] Scala/Spark dataframes: find the column name corresponding to the max

查看：122 发布时间：2020/9/4 1:26:58 scala apache-spark dataframe apache-spark-sql argmax

本文介绍了Scala/Spark数据框:找到与最大值对应的列名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Scala/Spark中，具有数据框:

In Scala/Spark, having a dataframe:

val dfIn = sqlContext.createDataFrame(Seq(
  ("r0", 0, 2, 3),
  ("r1", 1, 0, 0),
  ("r2", 0, 2, 2))).toDF("id", "c0", "c1", "c2")

我想计算一个新列maxCol，其中保留该列的 name 与最大值(每行)相对应.在此示例中，输出应为:

I would like to compute a new column maxCol holding the name of the column corresponding to the max value (for each row). With this example, the output should be:

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c1|
+---+---+---+---+------+

实际上，数据框有60列以上.因此，需要一个通用的解决方案.

Actually the dataframe have more than 60 columns. Thus a generic solution is required.

Python Pandas中的等效项(是的，我知道，我应该与pyspark ...进行比较)

The equivalent in Python Pandas (yes, I know, I should compare with pyspark...) could be:

dfOut = pd.concat([dfIn, dfIn.idxmax(axis=1).rename('maxCol')], axis=1)

推荐答案

通过一个小技巧，您可以使用greatest函数.所需的进口:

With a small trick you can use greatest function. Required imports:

import org.apache.spark.sql.functions.{col, greatest, lit, struct}

首先，我们创建一个structs列表，其中第一个元素为value，第二个为列名:

First let's create a list of structs, where the first element is value, and the second one column name:

val structs = dfIn.columns.tail.map(
  c => struct(col(c).as("v"), lit(c).as("k"))
)

可以将这样的结构传递给greatest，如下所示:

Structure like this can be passed to greatest as follows:

dfIn.withColumn("maxCol", greatest(structs: _*).getItem("k"))

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c2|
+---+---+---+---+------+

请注意，在平局的情况下，它将采用序列中稍后出现的元素(按字母顺序(x, "c2") > (x, "c1")).如果由于某种原因这是不可接受的，则可以使用when:

Please note that in case of ties it will take the element which occurs later in the sequence (lexicographically (x, "c2") > (x, "c1")). If for some reason this is not acceptable you can explicitly reduce with when:

import org.apache.spark.sql.functions.when

val max_col = structs.reduce(
  (c1, c2) => when(c1.getItem("v") >= c2.getItem("v"), c1).otherwise(c2)
).getItem("k")

dfIn.withColumn("maxCol", max_col)

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c1|
+---+---+---+---+------+

对于nullable列，您必须对此进行调整，例如通过coalescing调整为-Inf.

In case of nullable columns you have to adjust this, for example by coalescing to values to -Inf.

这篇关于Scala/Spark数据框:找到与最大值对应的列名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scala/Spark数据框:找到与最大值对应的列名称 [英] Scala/Spark dataframes: find the column name corresponding to the max

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scala/Spark数据框:找到与最大值对应的列名称 [英] Scala/Spark dataframes: find the column name corresponding to the max

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭