如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json [英] How to query nested json with internal arrays in Spark on basis of equality check

查看：25 发布时间：2021/11/14 23:28:20 json nested apache-spark-sql

本文介绍了如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个嵌套的 json 结构加载到 spark 中的数据帧中.它包含多层数组，我正试图弄清楚如何通过内部数组中的值来查询这个结构.

示例:考虑以下结构(directors.json 文件)

<预><代码>[{导演":史蒂文·斯皮尔伯格"，电影":[{"name": "E.T",演员":[亨利·托马斯"、德鲁·巴里摩尔"]},{"name": "The Goonies",演员":[肖恩·阿斯汀"、乔什·布洛林"]}]},{导演":昆汀塔伦蒂诺"，电影":[{"name": "低俗小说",演员":[约翰·特拉沃尔塔"、塞缪尔·杰克逊"]},{"name": "杀死比尔:第 1 卷",演员":[乌玛·瑟曼"、达里尔·汉娜"]}]}]

假设我想运行一个查询，该查询将返回特定演员参与的所有电影.类似这样:

val 董事 = spark.read.json("directors.json")Director.select($"films.name").where($"films.actors" === "Henry Thomas")

当我在 spark shell 中运行它时，出现异常:

org.apache.spark.sql.AnalysisException:由于数据类型不匹配，无法解析 '(`films`.`actors` = 'Henry Thomas')':'(`films`.` 中的类型不同)actor` = 'Henry Thomas')' (array<array<string>> and string).;;'项目 [名称#128]+- '过滤器 (films#92.actors = Henry Thomas)+- 分析屏障+- 项目 [电影#92.name AS name#128，电影#92]+- 关系[导演#91，电影#92] json

我该如何正确地进行这样的查询?

有不同的选择吗?如果是，利弊是什么?

谢谢

编辑

@thebluephantom 这仍然不起作用.得到类似的异常.我认为这是因为我在另一个数组中有一个数组.这是个例外:

org.apache.spark.sql.AnalysisException:由于数据类型不匹配，无法解析array_contains(`films`.`actors`, 'Henry Thomas')':参数必须是一个数组，后跟一个值与数组成员类型相同；'过滤器 array_contains(films#7.actors, Henry Thomas)+- 分析屏障+- 项目 [导演#6，电影#7]+- 关系[导演#6，电影#7] json

解决方案

尝试类似的方法，必须分解电影数据，这意味着重复的演员组很简单 - 否则我也无法让它工作 - 也许其他人可以:

使用 SPARK 2.3.1 更完整，如下所示:

val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/FileStore/tables/films.txt")val flattened = df.select($"director",explode($"films").as("films_flat"))flattened.select ("*").where (array_contains (flattened("films_flat.actors"), "Henry Thomas")).show(false)

 +----------------+-----------------------------------+|导演 |films_flat |+----------------+-------------------------------------+|史蒂文·斯皮尔伯格|[[亨利·托马斯，德鲁·巴里摩尔]，E.T]|+----------------+-------------------------------------+

I have a nested json structure loaded into a dataframe in spark. It contains several layers of arrays and I'm trying to figure out how to query this structure by values in the internal arrays.

Example: consider the following structure (directors.json file)

[
  {
    "director": "Steven Spielberg",
    "films": [
      {
        "name": "E.T",
        "actors": ["Henry Thomas", "Drew Barrymore"]
      },
      {
        "name": "The Goonies",
        "actors": ["Sean Astin", "Josh Brolin"]
      }
    ]
  },
  {
    "director": "Quentin Tarantino",
    "films": [
      {
        "name": "Pulp Fiction",
        "actors": ["John Travolta", "Samuel L. Jackson"]
      },
      {
        "name": "Kill Bill: Vol. 1",
        "actors": ["Uma Thurman", "Daryl Hannah"]
      }
    ]
  }
]

Lets say I want to run a query that will return all the films that a specific actor has participated in. something like this:

val directors = spark.read.json("directors.json")
directors.select($"films.name").where($"films.actors" === "Henry Thomas")

When I run this in the spark shell I get an exception:

org.apache.spark.sql.AnalysisException: cannot resolve '(`films`.`actors` = 'Henry Thomas')' due to data type mismatch: differing types in '(`films`.`actors` = 'Henry Thomas')' (array<array<string>> and string).;;
'Project [name#128]
+- 'Filter (films#92.actors = Henry Thomas)
   +- AnalysisBarrier
         +- Project [films#92.name AS name#128, films#92]
            +- Relation[director#91,films#92] json

How do I properly make such a query?

Are there different alternatives? If So, what are the pros and cons?

Thanks

Edit

@thebluephantom this still doesn't work. getting similar exception. I think it's because I have an array within another array. This is the exception:

org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(`films`.`actors`, 'Henry Thomas')' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members;;
'Filter array_contains(films#7.actors, Henry Thomas)
+- AnalysisBarrier
      +- Project [director#6, films#7]
         +- Relation[director#6,films#7] json

解决方案

Try something similar to this whereby the film data must be exploded which means the repeating group of actors is simple normalized - otherwise I cannot get it to work either - maybe someone else can:

More complete using SPARK 2.3.1 as follows with your data:

val df = spark.read
   .option("multiLine", true).option("mode", "PERMISSIVE")
   .json("/FileStore/tables/films.txt")

val flattened = df.select($"director", explode($"films").as("films_flat"))
flattened.select ("*").where (array_contains (flattened("films_flat.actors"), "Henry Thomas")).show(false)

returns:

 +----------------+-------------------------------------+
 |director        |films_flat                           |
 +----------------+-------------------------------------+
 |Steven Spielberg|[[Henry Thomas, Drew Barrymore], E.T]|
 +----------------+-------------------------------------+

这篇关于如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json [英] How to query nested json with internal arrays in Spark on basis of equality check

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json [英] How to query nested json with internal arrays in Spark on basis of equality check

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭