Spark-将整行传递给udf,然后在udf中获取列名 [英] Spark - pass full row to a udf and then get column name inside udf

查看:92
本文介绍了Spark-将整行传递给udf,然后在udf中获取列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Scala中使用Spark,并希望将整行传递给udf,并为udf侧的每个列名称和列值选择一个.我该怎么办?

I am using Spark with Scala and want to pass the entire row to udf and select for each column name and column value in side udf. How can I do this?

我正在尝试关注-

inputDataDF.withColumn("errorField", mapCategory(ruleForNullValidation) (col(_*)))

def mapCategory(categories: Map[String, Boolean]) = {
  udf((input:Row) =>  //write a recursive function to check if each row is in categories if yes check for null if null then false, repeat this for all columns and then combine results)   
})

推荐答案

在Spark 1.6中,可以将Row用作外部类型,将struct用作表达式.作为表达.可以从架构中获取列名.例如:

In Spark 1.6 you can use Row as external type and struct as expression. as expression. Column name can be fetched from the schema. For example:

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, struct}

val df = Seq((1, 2, 3)).toDF("a", "b", "c")
val f = udf((row: Row) => row.schema.fieldNames)
df.select(f(struct(df.columns map col: _*))).show

// +-----------------------------------------------------------------------------+
// |UDF(named_struct(NamePlaceholder, a, NamePlaceholder, b, NamePlaceholder, c))|
// +-----------------------------------------------------------------------------+
// |                                                                    [a, b, c]|
// +-----------------------------------------------------------------------------+

可以使用Row.getAs方法按名称访问值.

Values can be accessed by name using Row.getAs method.

这篇关于Spark-将整行传递给udf,然后在udf中获取列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆