如何优化Apache中的星火应用洗牌溢出 [英] How to optimize shuffle spill in Apache Spark application

查看：136 发布时间：2016/5/22 15:19:26 apache-spark spark-streaming

本文介绍了如何优化Apache中的星火应用洗牌溢出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用2工人星火流应用程序。
应用程序有一个联接和联合行动。

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

所有批次成功完成，但注意到，洗牌溢出指标与输入数据的大小或输出数据大小是一致的（溢出存储器是20倍以上）。

All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).

请找到下面的图像中的火花阶段的详细信息：

Please find the spark stage details in the below image:

在这个研究之后，发现

在没有进行洗牌的数据足够的内存溢出随机发生。

Shuffle spill happens when there is not sufficient memory for shuffle data.

随机溢出（内存） - 在溢出时内存中的数据的反序列化形式的大小

Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling

洗牌溢出（磁盘） - 溢出后，磁盘上的数据的序列化形式的大小

shuffle spill (disk) - size of the serialized form of the data on disk after spilling

由于反序列化的数据占用比序列化的数据更多的空间。因此，随机溢出（内存）为多。

Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.

注意到，这溢出内存大小为大的输入数据

的我的查询是：的

这是否溢出影响性能相当？

Does this spilling impacts the performance considerably?

如何优化这个溢出内存和硬盘？

How to optimize this spilling both memory and disk?

是否有可以减少/控制这个巨大的溢出？任何星火属性

Are there any Spark Properties that can reduce/ control this huge spilling?

如何优化Apache中的星火应用洗牌溢出 [英] How to optimize shuffle spill in Apache Spark application

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何优化Apache中的星火应用洗牌溢出 [英] How to optimize shuffle spill in Apache Spark application

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭