定义在Spark DataFrame中接受对象数组的UDF? [英] Defining a UDF that accepts an Array of objects in a Spark DataFrame?

查看：281 发布时间：2020/9/3 23:17:32 scala apache-spark dataframe apache-spark-sql user-defined-functions

本文介绍了定义在Spark DataFrame中接受对象数组的UDF?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用Spark的DataFrame时，需要用户定义函数(UDF)才能映射列中的数据. UDF要求明确指定参数类型.就我而言，我需要操作由对象数组组成的列，而且我不知道要使用哪种类型.这是一个示例:

When working with Spark's DataFrames, User Defined Functions (UDFs) are required for mapping data in columns. UDFs require that argument types are explicitly specified. In my case, I need to manipulate a column that is made up of arrays of objects, and I do not know what type to use. Here's an example:

import sqlContext.implicits._

// Start with some data. Each row (here, there's only one row) 
// is a topic and a bunch of subjects
val data = sqlContext.read.json(sc.parallelize(Seq(
  """
  |{
  |  "topic" : "pets",
  |  "subjects" : [
  |    {"type" : "cat", "score" : 10},
  |    {"type" : "dog", "score" : 1}
  |  ]
  |}
  """)))

使用内置的org.apache.spark.sql.functions对列中的数据执行基本操作相对简单

It's relatively straightforward to use the built-in org.apache.spark.sql.functions to perform basic operations on the data in the columns

import org.apache.spark.sql.functions.size
data.select($"topic", size($"subjects")).show

+-----+--------------+
|topic|size(subjects)|
+-----+--------------+
| pets|             2|
+-----+--------------+

通常，编写自定义UDF来执行任意操作通常很容易

and it's generally easy to write custom UDFs to perform arbitrary operations

import org.apache.spark.sql.functions.udf
val enhance = udf { topic : String => topic.toUpperCase() }
data.select(enhance($"topic"), size($"subjects")).show 

+----------+--------------+
|UDF(topic)|size(subjects)|
+----------+--------------+
|      PETS|             2|
+----------+--------------+

但是，如果我想使用UDF来操纵主题"列中的对象数组，该怎么办?我在UDF中使用哪种类型的参数?例如，如果我想重新实现size函数，而不是使用spark提供的函数:

But what if I want to use a UDF to manipulate the array of objects in the "subjects" column? What type do I use for the argument in the UDF? For example, if I want to reimplement the size function, instead of using the one provided by spark:

val my_size = udf { subjects: Array[Something] => subjects.size }
data.select($"topic", my_size($"subjects")).show

显然Array[Something]不起作用...我应该使用哪种类型！?我应该完全放弃Array[]吗?随便看看告诉我scala.collection.mutable.WrappedArray可能与它有关，但是仍然需要提供另一种类型.

Clearly Array[Something] does not work... what type should I use!? Should I ditch Array[] altogether? Poking around tells me scala.collection.mutable.WrappedArray may have something to do with it, but still there's another type I need to provide.

推荐答案

您正在寻找的是Seq[o.a.s.sql.Row]:

import org.apache.spark.sql.Row

val my_size = udf { subjects: Seq[Row] => subjects.size }

说明:

您已经知道，ArrayType的当前表示形式是WrappedArray，所以Array将不起作用，最好还是保持安全.
根据官方规范，StructType的本地(外部)类型为Row.不幸的是，这意味着对各个字段的访问类型不安全.

Current representation of ArrayType is, as you already know, WrappedArray so Array won't work and it is better to stay on the safe side.
According to the official specification, the local (external) type for StructType is Row. Unfortunately it means that access to the individual fields is not type safe.

注释:

要在Spark中创建struct< 2.3，传递给udf的函数必须返回Product类型(Tuple*或case class)，而不是Row.这是因为相应的udf变体取决于Scala反射:

To create struct in Spark < 2.3, function passed to udf has to return Product type (Tuple* or case class), not Row. That's because corresponding udf variants depend on Scala reflection:

将 n 个参数的Scala闭包定义为用户定义函数(UDF).数据类型是根据Scala闭包的签名自动推断的.

Defines a Scala closure of n arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature.

在Spark> = 2.3中，可以直接返回Row，

def udf(f: AnyRef, dataType: DataType): UserDefinedFunction 使用Scala闭包定义确定性用户定义函数(UDF).对于此变体，调用者必须指定输出数据类型，并且没有自动输入类型强制.

def udf(f: AnyRef, dataType: DataType): UserDefinedFunction Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion.

例如参见如何在Java/Kotlin中创建返回复杂类型的Spark UDF?.

这篇关于定义在Spark DataFrame中接受对象数组的UDF?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

定义在Spark DataFrame中接受对象数组的UDF? [英] Defining a UDF that accepts an Array of objects in a Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

定义在Spark DataFrame中接受对象数组的UDF? [英] Defining a UDF that accepts an Array of objects in a Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭