Spark 2.0弃用了"DirectParquetOutputCommitter"，如果没有它，该如何生存? [英] Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

查看：201 发布时间：2020/8/23 2:19:36 hadoop apache-spark amazon-s3 amazon-emr parquet

本文介绍了Spark 2.0弃用了"DirectParquetOutputCommitter"，如果没有它，该如何生存?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近，我们从"HDFS上的EMR"迁移到"S3上的EMR"(启用了一致视图的EMRFS)，我们意识到，与HDFS相比，写入S3的Spark"SaveAsTable"(镶木地板格式)的速度要慢4倍左右但是我们发现了使用DirectParquetOutputCommitter-[1] w/Spark 1.6的解决方法.

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.

造成S3速度缓慢的原因-我们不得不支付所谓的Parquet tax- [2]，其中默认输出提交者将其写入临时表中，并在以后对其进行重命名，因为S3中的重命名操作非常昂贵

Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive

我们也了解使用"DirectParquetOutputCommitter"的风险，如果启用了推测性任务，可能会导致数据损坏.

Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.

现在带Spark 2.0的该类已被弃用，我们想知道我们在表上有什么选择，以便在升级到Spark 2.0时不会承受约4倍的慢写速度.任何想法/建议/建议将不胜感激.

Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.

我们可以想到的一种解决方法是-保存在HDFS上，然后通过s3DistCp将其复制到S3(关于如何在Hive元数据存储指向S3时以合理的方式进行此操作的任何想法?)

One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)

NetFlix似乎已经修复了此[[3]，对何时计划开放源代码有任何想法吗?

Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?

谢谢.

[1]- [2]- https://www.appsflyer.com/blog/出血边缘火花镶木地板和s3/

[3]- https://www.youtube.com/watch?v = 85sew9OFaYc& feature = youtu.be& t = 8m39s http://www .slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-netflix-big-data-platform

Spark 2.0弃用了"DirectParquetOutputCommitter"，如果没有它，该如何生存? [英] Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 2.0弃用了"DirectParquetOutputCommitter"，如果没有它，该如何生存? [英] Spark 2.0 deprecates &#39;DirectParquetOutputCommitter&#39;, how to live without it?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark 2.0弃用了"DirectParquetOutputCommitter"，如果没有它，该如何生存? [英] Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

登录关闭