在Apache Spark中读取漂亮的打印json文件 [英] Reading pretty print json files in Apache Spark

查看:73
本文介绍了在Apache Spark中读取漂亮的打印json文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的S3存储桶中有很多json文件,我希望能够读取它们并查询这些文件.问题是它们印得很漂亮.一个json文件只有一个庞大的字典,但不是一行.根据线程,该词典是json文件应该在一行中,这是Apache Spark的限制.我没有那样的结构.

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

我的JSON模式如下所示-

My JSON schema looks like this -

{
    "dataset": [
        {
            "key1": [
                {
                    "range": "range1", 
                    "value": 0.0
                }, 
                {
                    "range": "range2", 
                    "value": 0.23
                }
             ]
        }, {..}, {..}
    ],
    "last_refreshed_time": "2016/09/08 15:05:31"
}

这是我的问题-

  1. 是否可以避免将这些文件转换为与Apache Spark所需的架构匹配(文件中的每一行一个字典),并且仍然能够读取它?

  1. Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

如果没有,用Python做到的最好方法是什么?我每天在存储桶中都有大量这些文件.该存储桶按天划分.

If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

除了Apache Spark,是否还有其他工具更适合查询这些文件?我在AWS堆栈上,因此可以尝试使用Zeppelin笔记本的任何其他建议的工具.

Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

推荐答案

您可以使用sc.wholeTextFiles()这是相关的

You could use sc.wholeTextFiles() Here is a related post.

或者,您可以使用一个简单的函数重新格式化json并加载生成的文件.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):
    with open(input_path, 'r') as handle:
        jarr = json.load(handle)

    f = open(output_path, 'w')
    for entry in jarr:
        f.write(json.dumps(entry)+"\n")
    f.close()

这篇关于在Apache Spark中读取漂亮的打印json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆