数组中第 n 项的 SparkSQL sql 语法 [英] SparkSQL sql syntax for nth item in array

查看:34
本文介绍了数组中第 n 项的 SparkSQL sql 语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 json 对象,它不幸地结合了嵌套和数组.所以如何用spark sql查询并不是很明显.

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

这是一个示例对象:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

因此,在 javascript 中,要获取 c 的值,我会编写 myData.stuff[0].c

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

并且在我的 spark sql 查询中,如果该数组不存在,我将能够使用点表示法:

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

但我不能,因为最里面的对象被包裹在一个数组中.

but I can't, because the innermost object is wrapped in an array.

我试过了:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

那么,选择该数据的神奇方法是什么?或者现在还支持吗?

So, what is the magical way to select that data? or is that even supported yet?

推荐答案

JSON 对象 不是很清楚,所以让我们考虑两种不同的情况:

It is not clear what you mean by JSON object so lets consider two different cases:

  1. 结构体数组

  1. An array of structs

import tempfile    

path = tempfile.mktemp()
with open(path, "w") as fw: 
    fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: struct (containsNull = true)
##  |    |    |-- a: long (nullable = true)
##  |    |    |-- b: long (nullable = true)
##  |    |    |-- c: long (nullable = true)

sqlContext.sql("SELECT stuff[0].a FROM df").show()

## +---+
## |_c0|
## +---+
## |  1|
## +---+

  • 一组地图

  • An array of maps

    # Note: schema inference from dictionaries has been deprecated
    # don't use this in practice
    df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: map (containsNull = true)
    ##  |    |    |-- key: string
    ##  |    |    |-- value: long (valueContainsNull = true)
    
    sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    

  • 另见使用复杂类型查询 Spark SQL DataFrame

    这篇关于数组中第 n 项的 SparkSQL sql 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆