如何在Scala Spark中使用withColumn的另一列值来组合列名 [英] How to compose column name using another column's value for withColumn in Scala Spark
问题描述
我正在尝试向DataFrame
添加新列.该列的值是另一列的值,该列的名称取决于同一DataFrame
中的其他列.
I'm trying to add a new column to a DataFrame
. The value of this column is the value of another column whose name depends on other columns from the same DataFrame
.
例如,鉴于此:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
我想获得这个:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
也就是说,我添加了C列,其值来自A_1或B_2列.源列A_1的名称来自串联列A和B的值.
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
我知道我可以基于另一个和这样的常量添加新列:
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
我也知道列的名称可以来自这样的变量:
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
但是,我想做的是这样的:
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
什么都不起作用.
注意:我正在使用Scala 2.11和Spark 2.2进行编码.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
推荐答案
您可以通过编写udf
函数来满足您的要求. 我建议您使用udf
,因为您的要求是按行处理dataframe
行,而与内置函数相抵触,而内置函数则按列按列 em> .
You can achieve your requirement by writing a udf
function. I am suggesting udf
, as your requirement is to process dataframe
row by row contradicting to inbuilt functions which functions column by column.
但是在此之前,您需要列名数组
But before that you would need array of column names
val columns = df.columns
然后将udf
函数编写为
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
其中
A is the first column value
B is the second column value
array is the Array of all the columns values
现在只需使用withColumn
api
Now just call the udf
function using withColumn
api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
您应该获得所需的输出dataframe
.
You should get your desired output dataframe
.
这篇关于如何在Scala Spark中使用withColumn的另一列值来组合列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!