AWS Glue需要很长时间才能完成 [英] AWS Glue takes a long time to finish

查看:223
本文介绍了AWS Glue需要很长时间才能完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是按照以下方式做一个非常简单的工作

I just run a very simple job as follows

glueContext = GlueContext(SparkContext.getOrCreate())
l_table = glueContext.create_dynamic_frame.from_catalog(
             database="gluecatalog",
             table_name="fctable") 
l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code')
print "Count: ", l_table.count()
l_table.printSchema()
l_table.select_fields(['trans_time']).toDF().distinct().show()
dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/")
print "Before keys() call "
dfc.keys()
print "After keys() call "
l_table.select_fields('table').printSchema()
dfc.select('table_root_table').toDF().where("id = 1 or id = 2").orderBy(['id','index']).show()
dfc.select('table_root').toDF().where("table = 1 or table = 2").show()

数据结构也很简单

root
|-- table: array
| |-- element: struct
| | |-- trans_time: string
| | |-- seq: null
| | |-- operation: string
| | |-- order_date: string
| | |-- order_code: string
| | |-- tbl_code: string
| | |-- ship_plant_code: string
|-- partition_0
|-- partition_1
|-- partition_2
|-- partition_3

我进行工作测试时,需要12到16分钟才能完成.但是云监视日志显示该工作花了2秒钟来显示我的所有数据.

When I run job test, it took anywhere from 12 to 16 minutes to finish. But the cloud watch log showed that the job took 2 seconds to display all my data.

所以我的问题是: AWS Glue作业在哪里花费的时间超出了日志显示的范围,它在日志记录时间段之外的工作是什么?

So my questions are: Where does AWS Glue job spend its time beyond the logging could show and is what it doing outside the logging period?

推荐答案

花时间设置允许您的代码运行的环境.我遇到了同样的问题,联系了AWS GLUE团队,他们对您有所帮助.之所以需要很长时间,是因为如果您两次运行同一脚本或在一小时内运行任何其他脚本,那么GLUE会在您运行第一个作业(保持活动1小时)时建立一个环境,因此下一个作业将花费更少的时间.当您运行第一个脚本时,他们将其称为冷启动".我的第一项工作花了17分钟,我在第一个工作完成后又再次运行了相同的工作,只花了3分钟.

It's taking the time to setup the environment that allows your code to run. I had the same issue, contacted the AWS GLUE team and they were helpful. The reason it takes a long time is that GLUE builds an environment when you run the first job (which stays alive for 1 hours) if you run the same script twice or any other script within one hour, the next job will take significantly less time. They call this Cold Start when you run the first script, It took my first job 17 minutes, I ran the same job again right after the first one finished and it took 3 minutes only.

这篇关于AWS Glue需要很长时间才能完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆