AWS Glue:ETL作业会创建许多空的输出文件 [英] AWS Glue: ETL job creates many empty output files

查看：77 发布时间：2021/4/13 18:34:21 amazon-web-services aws-glue

本文介绍了AWS Glue:ETL作业会创建许多空的输出文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对此很陌生，所以不确定是否可以简化此脚本/如果我做错了什么导致这种情况发生.我已经为AWS Glue编写了ETL脚本，该脚本写入了S3存储桶中的目录.

I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening. I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"

# output directories
output_dir = "s3://whatever/output"

# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)

# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")

# write frame to json files 
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()

base_event_info 表具有4列: event_id ， event_name ， platform ， client_info event_details 表有2列: event_id ， event_details


The base_event_info table has 4 columns: event_id, event_name, platform, client_info 
The event_details table has 2 columns: event_id, event_details
联接的表架构应类似于: event_id ， event_name ， platform ， client_info ，event_details  
The joined table schema should look like: event_id, event_name, platform, client_info, event_details
运行此作业后，我希望得到2个json文件，因为这是结果联接表中的记录数.(表中有两条记录具有相同的 event_id ).但是，我得到的是大约200个文件，它们的形式为 run-1540321737719-part-r-00000 ， run-1540321737719-part-r-00001 ，等等:
After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table. (There are two records in the tables with the same event_id) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000, run-1540321737719-part-r-00001, etc:
 198个文件包含0个字节
 2个文件包含250个字节(每个文件具有与扩展事件相对应的正确信息)
这是预期的行为吗?为什么这项工作会产生那么多的空文件?我的脚本有问题吗?
Is this the expected behavior? Why is this job generating so many empty files? Is there something wrong with my script? 
推荐答案
 Spark SQL模块包含以下默认配置:
The Spark SQL module contains the following default configuration: 
 spark.sql.shuffle.partitions设置为200.

  spark.sql.shuffle.partitions set to 200.
这就是为什么您首先获得200个文件的原因.您可以通过执行以下操作来检查是否是这种情况:
that's why you are getting 200 files in the first place.
You can check if this is the case by doing the following:
enriched_event_df.rdd.getNumPartitions()

如果您获得200的值，则可以使用以下代码更改要生成的文件数:
if you get a value of 200 then you can change it with the number of files you want to generate with the following code:
enriched_event_df.repartition(2)

上面的代码将仅使用您的数据创建两个文件.
The above code will create only two files with your data.

                        这篇关于AWS Glue:ETL作业会创建许多空的输出文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

AWS Glue:ETL作业会创建许多空的输出文件 [英] AWS Glue: ETL job creates many empty output files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWS Glue:ETL作业会创建许多空的输出文件 [英] AWS Glue: ETL job creates many empty output files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭