如何优化Apache中的星火应用洗牌溢出 [英] How to optimize shuffle spill in Apache Spark application

查看:136
本文介绍了如何优化Apache中的星火应用洗牌溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用2工人星火流应用程序。
应用程序有一个联接和联合行动。

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

所有批次成功完成,但注意到,洗牌溢出指标与输入数据的大小或输出数据大小是一致的(溢出存储器是20倍以上)。

All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).

请找到下面的图像中的火花阶段的详细信息:

Please find the spark stage details in the below image:

在这个研究之后,发现

在没有进行洗牌的数据足够的内存溢出随机发生。

Shuffle spill happens when there is not sufficient memory for shuffle data.

随机溢出(内存) - 在溢出时内存中的数据的反序列化形式的大小

Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling

洗牌溢出(磁盘) - 溢出后,磁盘上的数据的序列化形式的大小

shuffle spill (disk) - size of the serialized form of the data on disk after spilling

由于反序列化的数据占用比序列化的数据更多的空间。因此,随机溢出(内存)为多。

Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.

注意到,这溢出内存大小为大的输入数据

我的查询是:

这是否溢出影响性能相当?

Does this spilling impacts the performance considerably?

如何优化这个溢出内存和硬盘?

How to optimize this spilling both memory and disk?

是否有可以减少/控制这个巨大的溢出?任何星火属性

Are there any Spark Properties that can reduce/ control this huge spilling?

推荐答案

学习性能调整星火需要相当多的调查和学习。有几个很好的资源,其中包括该视频。星火1.4在界面一些更好的诊断和可视化,它可以帮助你。

Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.

在摘要中,您溅出时RDD分区在阶段结束时的大小超过存储器可用于洗牌缓冲器的量。

In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.

您可以:


  1. 手动重新分区()您之前的阶段,让你从输入较小的分区。

  2. 将在你的遗嘱执行人过程中增加内存增加洗牌缓冲( spark.executor.memory

  3. 通过增加从0.2默认分配给它( spark.shuffle.memoryFraction )遗嘱执行人内存的分数增加洗牌缓冲区。你需要给回 spark.storage.memoryFraction

  4. 通过减少工作线程( SPARK_WORKER_CORES )的比例给执行程序的内存增加每个线程洗牌缓冲

  1. Manually repartition() your prior stage so that you have smaller partitions from input.
  2. Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
  3. Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
  4. Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory

如果有一个专家听,我很想知道更多关于如何memoryFraction设置互动和合理的范围内。

If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.

这篇关于如何优化Apache中的星火应用洗牌溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆