如何在Scala Spark中使用withColumn的另一列值来组合列名 [英] How to compose column name using another column's value for withColumn in Scala Spark

查看:1583
本文介绍了如何在Scala Spark中使用withColumn的另一列值来组合列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试向DataFrame添加新列.该列的值是另一列的值,该列的名称取决于同一DataFrame中的其他列.

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.

例如,鉴于此:

+---+---+----+----+
|  A|  B| A_1| B_2|
+---+---+----+----+
|  A|  1| 0.1| 0.3|
|  B|  2| 0.2| 0.4|
+---+---+----+----+

我想获得这个:

+---+---+----+----+----+
|  A|  B| A_1| B_2|   C|
+---+---+----+----+----+
|  A|  1| 0.1| 0.3| 0.1|
|  B|  2| 0.2| 0.4| 0.4|
+---+---+----+----+----+

也就是说,我添加了C列,其值来自A_1或B_2列.源列A_1的名称来自串联列A和B的值.

That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.

我知道我可以基于另一个和这样的常量添加新列:

I know that I can add a new column based on another and a constant like this:

df.withColumn("C", $"B" + 1)

我也知道列的名称可以来自这样的变量:

I also know that the name of the column can come from a variable like this:

val name = "A_1"
df.withColumn("C", col(name) + 1)

但是,我想做的是这样的:

However, what I'd like to do is something like this:

df.withColumn("C", col(s"${col("A")}_${col("B")}"))

什么都不起作用.

注意:我正在使用Scala 2.11和Spark 2.2进行编码.

NOTE: I'm coding in Scala 2.11 and Spark 2.2.

推荐答案

您可以通过编写udf函数来满足您的要求. 我建议您使用udf,因为您的要求是按行处理dataframe ,而与内置函数相抵触,而内置函数则按列按列 em> .

You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.

但是在此之前,您需要列名数组

But before that you would need array of column names

val columns = df.columns

然后将udf函数编写为

import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))

其中

A is the first column value
B is the second column value
array is the Array of all the columns values

现在只需使用withColumn api

Now just call the udf function using withColumn api

df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)

您应该获得所需的输出dataframe.

You should get your desired output dataframe.

这篇关于如何在Scala Spark中使用withColumn的另一列值来组合列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆