EMR/Spark 的 S3 写入时间极慢 [英] Extremely slow S3 write times from EMR/ Spark

查看:23
本文介绍了EMR/Spark 的 S3 写入时间极慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写信是想看看是否有人知道如何通过在 EMR 中运行的 Spark 加快 S3 写入时间?

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?

我的 Spark 作业需要 4 个多小时才能完成,但是集群仅在前 1.5 小时内处于负载状态.

My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours.

我很好奇 Spark 一直在做什么.我查看了日志,发现了许多 s3 mv 命令,每个文件一个.然后直接查看 S3,我看到我的所有文件都在 _temporary 目录中.

I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory.

其次,我担心我的集群成本,看来我需要为这个特定任务购买 2 小时的计算时间.但是,我最终购买了 5 个小时.我很好奇 EMR AutoScaling 在这种情况下是否可以帮助降低成本.

Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR AutoScaling can help with cost in this situation.

有些文章讨论了更改文件输出提交者算法,但我对此收效甚微.

Some articles discuss changing the file output committer algorithm but I've had little success with that.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

写入本地 HDFS 很快.我很好奇发出 hadoop 命令将数据复制到 S3 是否会更快?

Writing to the local HDFS is quick. I'm curious if issuing a hadoop command to copy the data to S3 would be faster?

推荐答案

您看到的是 outputcommitter 和 s3 的问题.提交作业在 _temporary 文件夹上应用 fs.rename 并且由于 S3 不支持重命名,这意味着单个请求现在正在将所有文件从 _temporary 复制和删除到其最终目的地..

What you are seeing is a problem with outputcommitter and s3. the commit job applies fs.rename on the _temporary folder and since S3 does not support rename it means that a single request is now copying and deleting all the files from _temporary to its final destination..

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") 仅适用于 hadoop 版本 > 2.7.它的作用是在提交任务而不是提交作业时从 _temporary 复制每个文件,因此它是分布式的并且工作得非常快.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") only works with hadoop version > 2.7. what it does is to copy each file from _temporary on commit task and not commit job so it is distributed and works pretty fast.

如果您使用旧版本的 hadoop,我会使用 Spark 1.6 并使用:

If you use older version of hadoop I would use Spark 1.6 and use:

sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

*注意它不适用于开启推测或以追加模式写入

*note that it does not work with specualtion turned on or writing in append mode

**另请注意,它在 Spark 2.0 中已弃用(由 algorithm.version=2 替换)

**also note that it is deprecated in Spark 2.0 (replaced by algorithm.version=2)

顺便说一句,在我的团队中,我们实际上使用 Spark 写入 HDFS,并在生产中使用 DISTCP 作业(特别是 s3-dist-cp)将文件复制到 S3,但这样做是出于其他几个原因(一致性、容错性)所以它没有必要......你可以使用我建议的方式快速写入 S3.

BTW in my team we actually write with Spark to HDFS and use DISTCP jobs (specifically s3-dist-cp) in production to copy the files to S3 but this is done for several other reasons (consistency, fault tolerance) so it is not necessary.. you can write to S3 pretty fast using what I suggested.

这篇关于EMR/Spark 的 S3 写入时间极慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆