如何使用胶水将s3中存储的json文件转换为csv? [英] How to convert json files stored in s3 to csv using glue?

查看:122
本文介绍了如何使用胶水将s3中存储的json文件转换为csv?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在s3中存储了一些json文件,我需要在它们所在的文件夹中将它们转换为csv格式.

I have some json files stored in s3, and I need to convert them, at the folder folder they are, to csv format.

当前,我正在使用胶水将它们映射到雅典娜,但是,正如我所说,现在我需要将它们映射到csv.

Currently I'm using glue to map them to athena, but, as I said, now I need to map them to csv.

是否可以使用胶水作业来做到这一点?

Is it possible to use a Glue JOB to do that?

我试图了解粘合作业是否可以爬到我的s3文件夹目录中,并将找到的所有json文件转换为csv(作为新文件).

I trying to understand if a glue job can crawl into my s3 folder directories, converting all json files it finds to csv (as new files).

如果不可能,是否有任何AWS服务可以帮助我做到这一点?

If not possible, is there any aws service that could help me do that?

这是我要运行的当前代码

Here's the current code i'm trying to run

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data"]}, format = "json")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data"}, format = "csv")

该作业运行没有错误,但在s3文件夹上似乎什么也没有发生. 我假设代码将从/dealer-data获取json文件并将其转换为与csv相同的文件夹.我可能是错的.

The job runs with no error, but nothing seems to happen on s3 folder. I'm supposing the code will get the json files from /dealer-data and convert to the same folder, as csv. I'm probably wrong.

好吧,我几乎可以按照需要的方式工作.

Ok, I almost made it work the way i needed.

问题是,创建动态框架仅适用于带有文件的文件夹,不适用于带有文件的子文件夹的文件夹.

The thing is, the create dynamic frame is only working for folders with files, not folders with subfolders with files.

import sys
import logging
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"]}, format = "json")

outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2/bla.csv"}, format = "csv")

以上方法有效,但仅适用于该目录(../2) 有没有办法读取给定文件夹和子文件夹的所有文件?

The above works, but only for that directory (../2) Is there a way to read all files given a folder and subfolders?

推荐答案

对于 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆