数组中第n个项目的SparkSQL SQL语法 [英] SparkSQL sql syntax for nth item in array

查看:118
本文介绍了数组中第n个项目的SparkSQL SQL语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个json对象,不幸的是嵌套和数组的组合.因此,如何使用spark sql查询它并不是很明显.

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

这是一个示例对象:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

所以,在javascript中,要获取c的值,我要写myData.stuff[0].c

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

在我的spark sql查询中,如果该数组不存在,则可以使用点表示法:

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

但是我不能,因为最里面的对象包装在一个数组中.

but I can't, because the innermost object is wrapped in an array.

我尝试过:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

那么,选择该数据的神奇方法是什么?还是已经支持了?

So, what is the magical way to select that data? or is that even supported yet?

推荐答案

JSON对象的含义不明确,因此让我们考虑两种不同的情况:

It is not clear what you mean by JSON object so lets consider two different cases:

  1. 结构数组

  1. An array of structs

import tempfile    

path = tempfile.mktemp()
with open(path, "w") as fw: 
    fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: struct (containsNull = true)
##  |    |    |-- a: long (nullable = true)
##  |    |    |-- b: long (nullable = true)
##  |    |    |-- c: long (nullable = true)

sqlContext.sql("SELECT stuff[0].a FROM df").show()

## +---+
## |_c0|
## +---+
## |  1|
## +---+

  • 一组地图

  • An array of maps

    # Note: schema inference from dictionaries has been deprecated
    # don't use this in practice
    df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: map (containsNull = true)
    ##  |    |    |-- key: string
    ##  |    |    |-- value: long (valueContainsNull = true)
    
    sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    

  • 另请参见查询具有复杂类型的Spark SQL DataFrame

    这篇关于数组中第n个项目的SparkSQL SQL语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆