Spark 2.0弃用了"DirectParquetOutputCommitter",如果没有它,该如何生存? [英] Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

查看:201
本文介绍了Spark 2.0弃用了"DirectParquetOutputCommitter",如果没有它,该如何生存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我们从"HDFS上的EMR"迁移到"S3上的EMR"(启用了一致视图的EMRFS),我们意识到,与HDFS相比,写入S3的Spark"SaveAsTable"(镶木地板格式)的速度要慢4倍左右但是我们发现了使用DirectParquetOutputCommitter-[1] w/Spark 1.6的解决方法.

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.

造成S3速度缓慢的原因-我们不得不支付所谓的Parquet tax- [2],其中默认输出提交者将其写入临时表中,并在以后对其进行重命名,因为S3中的重命名操作非常昂贵

Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive

我们也了解使用"DirectParquetOutputCommitter"的风险,如果启用了推测性任务,可能会导致数据损坏.

Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.

现在带Spark 2.0的该类已被弃用,我们想知道我们在表上有什么选择,以便在升级到Spark 2.0时不会承受约4倍的慢写速度.任何想法/建议/建议将不胜感激.

Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.

我们可以想到的一种解决方法是-保存在HDFS上,然后通过s3DistCp将其复制到S3(关于如何在Hive元数据存储指向S3时以合理的方式进行此操作的任何想法?)

One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)

NetFlix似乎已经修复了此[[3],对何时计划开放源代码有任何想法吗?

Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?

谢谢.

[1]- [2]- https://www.appsflyer.com/blog/出血边缘火花镶木地板和s3/

[3]- https://www.youtube.com/watch?v = 85sew9OFaYc& feature = youtu.be& t = 8m39s http://www .slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-netflix-big-data-platform

推荐答案

您可以使用:sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

由于您使用的是EMR,因此只需使用s3(无需s3a)

since you are on EMR just use s3 (no need for s3a)

我们正在使用Spark 2.0,并将Parquet写入S3的速度非常快(大约与HDFS一样快)

We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)

如果您想了解更多信息,请查看这张吉拉机票 SPARK-10063

if you want to read more check out this jira ticket SPARK-10063

这篇关于Spark 2.0弃用了"DirectParquetOutputCommitter",如果没有它,该如何生存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆