如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json [英] How to query nested json with internal arrays in Spark on basis of equality check
问题描述
我有一个嵌套的 json 结构加载到 spark 中的数据帧中.它包含多层数组,我正试图弄清楚如何通过内部数组中的值来查询这个结构.
示例:考虑以下结构(directors.json 文件)
<预><代码>[{导演":史蒂文·斯皮尔伯格",电影":[{"name": "E.T",演员":[亨利·托马斯"、德鲁·巴里摩尔"]},{"name": "The Goonies",演员":[肖恩·阿斯汀"、乔什·布洛林"]}]},{导演":昆汀塔伦蒂诺",电影":[{"name": "低俗小说",演员":[约翰·特拉沃尔塔"、塞缪尔·杰克逊"]},{"name": "杀死比尔:第 1 卷",演员":[乌玛·瑟曼"、达里尔·汉娜"]}]}]假设我想运行一个查询,该查询将返回特定演员参与的所有电影.类似这样:
val 董事 = spark.read.json("directors.json")Director.select($"films.name").where($"films.actors" === "Henry Thomas")
当我在 spark shell 中运行它时,出现异常:
org.apache.spark.sql.AnalysisException:由于数据类型不匹配,无法解析 '(`films`.`actors` = 'Henry Thomas')':'(`films`.` 中的类型不同)actor` = 'Henry Thomas')' (array<array<string>> and string).;;'项目 [名称#128]+- '过滤器 (films#92.actors = Henry Thomas)+- 分析屏障+- 项目 [电影#92.name AS name#128,电影#92]+- 关系[导演#91,电影#92] json
我该如何正确地进行这样的查询?
有不同的选择吗?如果是,利弊是什么?
谢谢
编辑
@thebluephantom 这仍然不起作用.得到类似的异常.我认为这是因为我在另一个数组中有一个数组.这是个例外:
org.apache.spark.sql.AnalysisException:由于数据类型不匹配,无法解析array_contains(`films`.`actors`, 'Henry Thomas')':参数必须是一个数组,后跟一个值与数组成员类型相同;'过滤器 array_contains(films#7.actors, Henry Thomas)+- 分析屏障+- 项目 [导演#6,电影#7]+- 关系[导演#6,电影#7] json
尝试类似的方法,必须分解电影数据,这意味着重复的演员组很简单 - 否则我也无法让它工作 - 也许其他人可以:
使用 SPARK 2.3.1 更完整,如下所示:
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/FileStore/tables/films.txt")val flattened = df.select($"director",explode($"films").as("films_flat"))flattened.select ("*").where (array_contains (flattened("films_flat.actors"), "Henry Thomas")).show(false)
返回:
+----------------+-----------------------------------+|导演 |films_flat |+----------------+-------------------------------------+|史蒂文·斯皮尔伯格|[[亨利·托马斯,德鲁·巴里摩尔],E.T]|+----------------+-------------------------------------+
I have a nested json structure loaded into a dataframe in spark. It contains several layers of arrays and I'm trying to figure out how to query this structure by values in the internal arrays.
Example: consider the following structure (directors.json file)
[
{
"director": "Steven Spielberg",
"films": [
{
"name": "E.T",
"actors": ["Henry Thomas", "Drew Barrymore"]
},
{
"name": "The Goonies",
"actors": ["Sean Astin", "Josh Brolin"]
}
]
},
{
"director": "Quentin Tarantino",
"films": [
{
"name": "Pulp Fiction",
"actors": ["John Travolta", "Samuel L. Jackson"]
},
{
"name": "Kill Bill: Vol. 1",
"actors": ["Uma Thurman", "Daryl Hannah"]
}
]
}
]
Lets say I want to run a query that will return all the films that a specific actor has participated in. something like this:
val directors = spark.read.json("directors.json")
directors.select($"films.name").where($"films.actors" === "Henry Thomas")
When I run this in the spark shell I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve '(`films`.`actors` = 'Henry Thomas')' due to data type mismatch: differing types in '(`films`.`actors` = 'Henry Thomas')' (array<array<string>> and string).;;
'Project [name#128]
+- 'Filter (films#92.actors = Henry Thomas)
+- AnalysisBarrier
+- Project [films#92.name AS name#128, films#92]
+- Relation[director#91,films#92] json
How do I properly make such a query?
Are there different alternatives? If So, what are the pros and cons?
Thanks
Edit
@thebluephantom this still doesn't work. getting similar exception. I think it's because I have an array within another array. This is the exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(`films`.`actors`, 'Henry Thomas')' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members;;
'Filter array_contains(films#7.actors, Henry Thomas)
+- AnalysisBarrier
+- Project [director#6, films#7]
+- Relation[director#6,films#7] json
Try something similar to this whereby the film data must be exploded which means the repeating group of actors is simple normalized - otherwise I cannot get it to work either - maybe someone else can:
More complete using SPARK 2.3.1 as follows with your data:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/films.txt")
val flattened = df.select($"director", explode($"films").as("films_flat"))
flattened.select ("*").where (array_contains (flattened("films_flat.actors"), "Henry Thomas")).show(false)
returns:
+----------------+-------------------------------------+
|director |films_flat |
+----------------+-------------------------------------+
|Steven Spielberg|[[Henry Thomas, Drew Barrymore], E.T]|
+----------------+-------------------------------------+
这篇关于如何基于相等性检查在 Spark 中使用内部数组查询嵌套 json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!