EMR/Spark的S3写入时间极慢 [英] Extremely slow S3 write times from EMR/ Spark

查看:308
本文介绍了EMR/Spark的S3写入时间极慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写信,看看是否有人知道如何通过EMR中运行的Spark加快S3的写入时间?

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?

我的Spark作业需要4个小时以上才能完成,但是群集仅在前1.5个小时内处于负载状态.

My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours.

我很好奇Spark一直在做什么.我查看了日志,发现很多s3 mv命令,每个文件一个.然后直接看一下S3,我发现我所有的文件都在 _temporary 目录中.

I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory.

第二,我担心群集成本,看来我需要购买2个小时的计算才能完成此特定任务.但是,我最终花了不到5个小时才买完东西.我很好奇在这种情况下EMR AutoScaling是否可以帮助降低成本.

Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR AutoScaling can help with cost in this situation.

一些文章讨论了更改文件输出提交器算法,但是我在此方面收效甚微.

Some articles discuss changing the file output committer algorithm but I've had little success with that.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

写入本地HDFS很快.我很好奇,发出hadoop命令将数据复制到S3是否会更快?

Writing to the local HDFS is quick. I'm curious if issuing a hadoop command to copy the data to S3 would be faster?

推荐答案

您看到的是outputcommitter和s3的问题. 提交作业将fs.rename应用于_temporary文件夹,并且由于S3不支持重命名,这意味着单个请求现在正在将_temporary的所有文件复制并删除到其最终目的地.

What you are seeing is a problem with outputcommitter and s3. the commit job applies fs.rename on the _temporary folder and since S3 does not support rename it means that a single request is now copying and deleting all the files from _temporary to its final destination..

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")仅适用于hadoop版本> 2.7.它的作用是在提交任务而不是提交作业时从_temporary复制每个文件,因此它可以分发并且工作非常快.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") only works with hadoop version > 2.7. what it does is to copy each file from _temporary on commit task and not commit job so it is distributed and works pretty fast.

如果您使用旧版的hadoop,我将使用Spark 1.6并使用:

If you use older version of hadoop I would use Spark 1.6 and use:

sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

*请注意,它不能在打开specspection或以追加模式编写的情况下使用

*note that it does not work with specualtion turned on or writing in append mode

**还请注意,它在Spark 2.0中已弃用(由algorithm.version = 2代替)

**also note that it is deprecated in Spark 2.0 (replaced by algorithm.version=2)

在我们团队中的BTW中,我们实际上是用Spark编写HDFS,并在生产中使用DISTCP作业(特别是s3-dist-cp)将文件复制到S3,但这是出于其他一些原因(一致性,容错性)而完成的是没有必要的.您可以使用我的建议快速写入S3.

BTW in my team we actually write with Spark to HDFS and use DISTCP jobs (specifically s3-dist-cp) in production to copy the files to S3 but this is done for several other reasons (consistency, fault tolerance) so it is not necessary.. you can write to S3 pretty fast using what I suggested.

这篇关于EMR/Spark的S3写入时间极慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆