定义在Spark DataFrame中接受对象数组的UDF? [英] Defining a UDF that accepts an Array of objects in a Spark DataFrame?

查看:281
本文介绍了定义在Spark DataFrame中接受对象数组的UDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Spark的DataFrame时,需要用户定义函数(UDF)才能映射列中的数据. UDF要求明确指定参数类型.就我而言,我需要操作由对象数组组成的列,而且我不知道要使用哪种类型.这是一个示例:

When working with Spark's DataFrames, User Defined Functions (UDFs) are required for mapping data in columns. UDFs require that argument types are explicitly specified. In my case, I need to manipulate a column that is made up of arrays of objects, and I do not know what type to use. Here's an example:

import sqlContext.implicits._

// Start with some data. Each row (here, there's only one row) 
// is a topic and a bunch of subjects
val data = sqlContext.read.json(sc.parallelize(Seq(
  """
  |{
  |  "topic" : "pets",
  |  "subjects" : [
  |    {"type" : "cat", "score" : 10},
  |    {"type" : "dog", "score" : 1}
  |  ]
  |}
  """)))

使用内置的org.apache.spark.sql.functions对列中的数据执行基本操作相对简单

It's relatively straightforward to use the built-in org.apache.spark.sql.functions to perform basic operations on the data in the columns

import org.apache.spark.sql.functions.size
data.select($"topic", size($"subjects")).show

+-----+--------------+
|topic|size(subjects)|
+-----+--------------+
| pets|             2|
+-----+--------------+

通常,编写自定义UDF来执行任意操作通常很容易

and it's generally easy to write custom UDFs to perform arbitrary operations

import org.apache.spark.sql.functions.udf
val enhance = udf { topic : String => topic.toUpperCase() }
data.select(enhance($"topic"), size($"subjects")).show 

+----------+--------------+
|UDF(topic)|size(subjects)|
+----------+--------------+
|      PETS|             2|
+----------+--------------+

但是,如果我想使用UDF来操纵主题"列中的对象数组,该怎么办?我在UDF中使用哪种类型的参数?例如,如果我想重新实现size函数,而不是使用spark提供的函数:

But what if I want to use a UDF to manipulate the array of objects in the "subjects" column? What type do I use for the argument in the UDF? For example, if I want to reimplement the size function, instead of using the one provided by spark:

val my_size = udf { subjects: Array[Something] => subjects.size }
data.select($"topic", my_size($"subjects")).show

显然Array[Something]不起作用...我应该使用哪种类型!?我应该完全放弃Array[]吗?随便看看告诉我scala.collection.mutable.WrappedArray可能与它有关,但是仍然需要提供另一种类型.

Clearly Array[Something] does not work... what type should I use!? Should I ditch Array[] altogether? Poking around tells me scala.collection.mutable.WrappedArray may have something to do with it, but still there's another type I need to provide.

推荐答案

您正在寻找的是Seq[o.a.s.sql.Row]:

import org.apache.spark.sql.Row

val my_size = udf { subjects: Seq[Row] => subjects.size }

说明:

  • 您已经知道,ArrayType的当前表示形式是WrappedArray,所以Array将不起作用,最好还是保持安全.
  • 根据官方规范StructType的本地(外部)类型为Row.不幸的是,这意味着对各个字段的访问类型不安全.
  • Current representation of ArrayType is, as you already know, WrappedArray so Array won't work and it is better to stay on the safe side.
  • According to the official specification, the local (external) type for StructType is Row. Unfortunately it means that access to the individual fields is not type safe.

注释:

  • 要在Spark中创建struct< 2.3,传递给udf的函数必须返回Product类型(Tuple*case class),而不是Row.这是因为相应的udf变体取决于Scala反射:

  • To create struct in Spark < 2.3, function passed to udf has to return Product type (Tuple* or case class), not Row. That's because corresponding udf variants depend on Scala reflection:

n 个参数的Scala闭包定义为用户定义函数(UDF).数据类型是根据Scala闭包的签名自动推断的.

Defines a Scala closure of n arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature.

  • 在Spark> = 2.3中,可以直接返回Row

    def udf(f: AnyRef, dataType: DataType): UserDefinedFunction 使用Scala闭包定义确定性用户定义函数(UDF).对于此变体,调用者必须指定输出数据类型,并且没有自动输入类型强制.

    def udf(f: AnyRef, dataType: DataType): UserDefinedFunction Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion.

    例如参见如何在Java/Kotlin中创建返回复杂类型的Spark UDF?.

    这篇关于定义在Spark DataFrame中接受对象数组的UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆