数组中第n个项目的SparkSQL SQL语法 [英] SparkSQL sql syntax for nth item in array

查看：118 发布时间：2020/9/4 8:26:20 python apache-spark pyspark apache-spark-sql

本文介绍了数组中第n个项目的SparkSQL SQL语法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个json对象，不幸的是嵌套和数组的组合.因此，如何使用spark sql查询它并不是很明显.

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

这是一个示例对象:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

所以，在javascript中，要获取c的值，我要写myData.stuff[0].c

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

在我的spark sql查询中，如果该数组不存在，则可以使用点表示法:

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

但是我不能，因为最里面的对象包装在一个数组中.

but I can't, because the innermost object is wrapped in an array.

我尝试过:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

那么，选择该数据的神奇方法是什么?还是已经支持了?

So, what is the magical way to select that data? or is that even supported yet?

推荐答案

JSON对象的含义不明确，因此让我们考虑两种不同的情况:

It is not clear what you mean by JSON object so lets consider two different cases:

结构数组

An array of structs

import tempfile    

path = tempfile.mktemp()
with open(path, "w") as fw: 
    fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: struct (containsNull = true)
##  |    |    |-- a: long (nullable = true)
##  |    |    |-- b: long (nullable = true)
##  |    |    |-- c: long (nullable = true)

sqlContext.sql("SELECT stuff[0].a FROM df").show()

## +---+
## |_c0|
## +---+
## |  1|
## +---+

一组地图

An array of maps

# Note: schema inference from dictionaries has been deprecated
# don't use this in practice
df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: map (containsNull = true)
##  |    |    |-- key: string
##  |    |    |-- value: long (valueContainsNull = true)

sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
## +---+
## |_c0|
## +---+
## |  1|
## +---+

另请参见查询具有复杂类型的Spark SQL DataFrame

这篇关于数组中第n个项目的SparkSQL SQL语法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数组中第n个项目的SparkSQL SQL语法 [英] SparkSQL sql syntax for nth item in array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

数组中第n个项目的SparkSQL SQL语法 [英] SparkSQL sql syntax for nth item in array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭