Spark-将整行传递给udf,然后在udf中获取列名 [英] Spark - pass full row to a udf and then get column name inside udf
本文介绍了Spark-将整行传递给udf,然后在udf中获取列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在Scala中使用Spark,并希望将整行传递给udf,并为udf侧的每个列名称和列值选择一个.我该怎么办?
I am using Spark with Scala and want to pass the entire row to udf and select for each column name and column value in side udf. How can I do this?
我正在尝试关注-
inputDataDF.withColumn("errorField", mapCategory(ruleForNullValidation) (col(_*)))
def mapCategory(categories: Map[String, Boolean]) = {
udf((input:Row) => //write a recursive function to check if each row is in categories if yes check for null if null then false, repeat this for all columns and then combine results)
})
推荐答案
在Spark 1.6中,可以将Row
用作外部类型,将struct
用作表达式.作为表达.可以从架构中获取列名.例如:
In Spark 1.6 you can use Row
as external type and struct
as expression. as expression. Column name can be fetched from the schema. For example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, struct}
val df = Seq((1, 2, 3)).toDF("a", "b", "c")
val f = udf((row: Row) => row.schema.fieldNames)
df.select(f(struct(df.columns map col: _*))).show
// +-----------------------------------------------------------------------------+
// |UDF(named_struct(NamePlaceholder, a, NamePlaceholder, b, NamePlaceholder, c))|
// +-----------------------------------------------------------------------------+
// | [a, b, c]|
// +-----------------------------------------------------------------------------+
可以使用Row.getAs
方法按名称访问值.
Values can be accessed by name using Row.getAs
method.
这篇关于Spark-将整行传递给udf,然后在udf中获取列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文