是否有一个Spark内置组件可以平整嵌套数组? [英] Is there a Spark built-in that flattens nested arrays?

查看:95
本文介绍了是否有一个Spark内置组件可以平整嵌套数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Seq[Seq[String]]的DataFrame字段,我构建了一个UDF来将所述列转换为Seq [String]的列;基本上是Scala中flatten函数的UDF.

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

我的用例是字符串,但是显然,这可能是通用的.您可以在一系列DataFrame转换中使用此功能,例如:

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

Spark内置功能是否具有相同功能?我一直找不到.

Is there a Spark built-in function that does the same thing? I have not been able to find one.

推荐答案

有一个类似的函数(自Spark 2.4起),它被称为flatten:

There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

来自

def flatten(e: Column): Column

从数组数组创建单个数组.如果嵌套数组的结构深于两层,则仅除去一层嵌套.

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

因为

2.4.0

要获得精确的等效值,您必须先coalesce来替换NULL.

To get the exact equivalent you'll have to coalesce to replace NULL.

这篇关于是否有一个Spark内置组件可以平整嵌套数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆