使用AWS Glue将很大的csv.gz(每个30-40 GB)转换为镶木地板 [英] Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

查看:156
本文介绍了使用AWS Glue将很大的csv.gz(每个30-40 GB)转换为镶木地板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多这样的问题,但似乎无济于事.我试图将相当大的csv.gz文件隐藏起来,并不断出现诸如此类的错误

There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like

'Command failed with exit code 1'

An error occurred while calling o392.pyWriteDynamicFrame. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-5-241.eu-central-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed

.在指标监视中,我看不到太多的CPU或内存负载.有ETL数据移动,但是使用S3时应触发任何错误.

. In the metrics monitoring I don't see much of CPU or Memory load. There is ETL data movement but that should be trigger any error when working with S3.

另一个问题是,这种作业在投掷前要运行4-5个小时.这是预期的行为吗? CSV文件的大小约为30-40列.

Another problem is that such job runs 4-5 hours before throwing. Is it an expected behavior? CSV files have like 30-40 cols.

我不知道该往哪个方向走.胶水可以整体处理这么大的文件吗?

I don't know which direction to go. Can Glue overall handle such large files?

推荐答案

我认为问题不直接与DPU的数量有关.您的文件很大,并且使用的是GZIP格式,该格式不可拆分,因此存在此问题.

I think the problem isn't directly connected to the number of DPUs. You have large file and you are using GZIP format which it’s not splittable therefore you have this problem.

我建议将文件从GZIP转换为bzip2或lz4.另外,您应该考虑使用输出数据的分区,以在将来获得更好的性能.

I suggest to convert your file from GZIP to bzip2 or lz4. Additionaly you should consider to use partitioning of output data for better performance in the future.

http://comphadoop.weebly.com/

这篇关于使用AWS Glue将很大的csv.gz(每个30-40 GB)转换为镶木地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆