Spark SQL - 嵌套数组条件选择 [英] Spark SQL - Nested array conditional select

查看:90
本文介绍了Spark SQL - 嵌套数组条件选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 spark SQL 问题 我很欣赏一些关于从嵌套结构数组中进行条件选择的最佳方法的指导.

I have a spark SQL question Id appreciate some guidance on the best way to do a conditional select from nested array of structs.

我在下面有一个示例 json 文档

I have an example json document below

```

{
   "id":"p1",
   "externalIds":[
      {"system":"a","id":"1"},
      {"system":"b","id":"2"},
      {"system":"c","id":"3"}
    ]
}

```

在 spark SQL 中,我想根据一些条件逻辑选择数组结构之一的id".

In spark SQL I want to select the "id" of one of the array structs based on some conditional logic.

例如上面,选择数组子元素的id字段为system"=b",即id为2".

e.g for above, select the id field of array sub element that has "system" = "b", namely the id of "2".

在 SparkSQL 中如何最好地做到这一点?

How best to do this in SparkSQL?

干杯和感谢!

推荐答案

使用 UDF,给定一个 Dataframe(字符串类型的所有属性):

Using an UDF, this could look like this, given a Dataframe (all attributes of type String):

+---+---------------------+
|id |externalIds          |
+---+---------------------+
|p1 |[[a,1], [b,2], [c,3]]|
+---+---------------------+

定义一个 UDF 来遍历数组并找到所需的元素:

Define an UDF to traverse your array and find the desired element:

def getExternal(system: String) = {
  udf((row: Seq[Row]) =>
    row.map(r => (r.getString(0), r.getString(1)))
      .find { case (s, _) => s == system}
      .map(_._2)
      .orElse(None)
  )
}

并像这样使用它:

df
  .withColumn("external",getExternal("b")($"externalIds"))
  .show(false)

+---+---------------------+--------+
|id |externalIds          |external|
+---+---------------------+--------+
|p1 |[[a,1], [b,2], [c,3]]|2       |
+---+---------------------+--------+

这篇关于Spark SQL - 嵌套数组条件选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆