Spark任务内存分配 [英] Spark Task Memory allocation

查看:152
本文介绍了Spark任务内存分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出在群集节点上配置内存的最佳方法.但是我相信,为此,我需要进一步了解一些事情,例如spark如何处理跨任务的内存.

I am trying to find out the best way to configure the memory on the nodes of my cluster. However I believe that for that there some things that I need to further understand such as how spark handles memory across tasks.

例如,假设我有3个执行程序,每个执行程序最多可以并行运行8个任务(即8个内核).如果我有一个带有24个分区的RDD,那么这意味着理论上所有分区都可以并行处理.但是,如果我们在此处放大一个执行程序,则假定每个任务可以在内存中有其分区以对其执行操作.如果不是这样,则意味着将不会并行执行8个任务,并且需要进行一些调度.

For example, let's say I have 3 executors, each executor can run up to 8 tasks in parallel (i.e. 8 cores). If I have an RDD with 24 partitions, this means that theoretically all partitions can be processed in parallel. However if we zoom into one executor here, this assumes that each task can have their partition in memory for operating on them. If not then it means that 8 tasks in parallel will not happen and some scheduling will be required.

因此,我得出的结论是,如果有人寻求真正的并行性,那么对分区的大小有所了解会有所帮助,因为它将告知您如何调整执行程序的大小以实现真正的并行性.

Hence I concluded that If one sought true parallelism having some idea about the size of your partitions would be helpful as it would inform you as to how to size your executor for true parallelism.

Q0-我只是想更好地理解,如果没有,会发生什么 所有分区都可以容纳在一个执行器中的内存中?有一些 当其他人在内存中进行操作时溢出到磁盘上吗?会产生火花
为每个任务保留内存,如果它检测到没有 足够,它可以安排任务吗?还是只在外面跑一个
内存错误.

Q0 - I simply want to understand a little better, what happens when not all the partitions can fit in memory within one executor? Are some spilled on disk while others are in memory for operations? Does spark
reserve the memory for each task, and if it detects that there is not enough, does it schedule the tasks? Or do one simply run in an out
of memory error.

  • Q1-是否在 执行程序还取决于执行程序上可用的内存量吗? 换句话说,我的集群中可能有8个核心,但是如果没有 足够的内存来一次加载我的数据的8个分区,那么我就不会 完全平行.

    Q1 - Does true parallelism within an executor depend also on the amount of memory available on the executor? In other words, I may have 8 cores in my cluster, but if I do not have enough memory to load 8 partitions of my data at once, then I won't be fully parallel.

  • 作为最后的注释,我已经多次看到以下语句,但发现它有点令人困惑:

    As a final note I have seen several times the following statement but find it confusing a little:

    增加分区数量也可以帮助减少 内存不足错误,因为这意味着Spark将在 每个执行者的较小数据子集."

    "Increasing the number of partitions can also help reduce out-of-memory errors, since it means that Spark will operate on a smaller subset of the data for each executor."

    这是如何工作的?我的意思是spark可能在较小的子集上起作用,但是如果总分区集仍然无法容纳在内存中,那会发生什么呢?

    How does that work exactly ? I mean spark may work on smaller subset but if the total set of partitions can't fit in memory anyway, what happens?

    推荐答案

    为什么要增加任务(分区)数量?

    我想首先回答使您感到困惑的最后一个问题. 这是另一个问题的引用:

    Why should I increase number of tasks (partitions)?

    I would like to answer first on the last question that is confusing you. Here is a quote from another question:

    Spark不需要加载内存中的所有内容就可以对其进行处理.这是因为Spark会将数据划分为较小的块,并分别对它们进行操作.

    Spark does not need to load everything in memory to be able to process it. This is because Spark will partition the data into smaller blocks and operate on these separately.

    事实上,默认情况下,Spark会尝试自动分割输入数据划分为一些最佳分区:

    In fact, by default Spark tries to split input data automatically into some optimal number of partitions:

    Spark会根据文件大小自动设置要在每个文件上运行的地图"任务的数量

    Spark automatically sets the number of "map" tasks to run on each file according to its size

    可以指定正在执行的操作的分区数(例如cogroup:def cogroup[W](other: RDD[(K, W)], numPartitions: Int)),还可以在进行任何RDD转换后执行.repartition().

    One can specify number of partitions of the operation that is being performed (like for cogroup: def cogroup[W](other: RDD[(K, W)], numPartitions: Int)), and also do a .repartition() after any RDD transformation.

    此外,稍后在文档的同一段中,他们说:

    Moreover, later in the same paragraph of the documentation they say:

    通常,我们建议您群集中的每个CPU核心执行2-3个任务.

    In general, we recommend 2-3 tasks per CPU core in your cluster.

    总结:

    1. 默认的分区数量是一个好的开始;
    2. 通常建议
    3. 每个CPU 2-3个分区.

    Spark如何处理内存中不适合的输入?

    简而言之,通过划分输入结果和中间结果(RDD).通常,每个小块都适合执行程序可用的内存,并且可以快速处理.

    How does Spark deal with inputs that do not fit in memory?

    In short, by partitioning input and intermediate results (RDDs). Usually each small chunk fits in memory available for the executor and is processed fastly.

    Spark能够缓存RDD 它已经计算了.默认情况下,每次重用RDD都会重新计算(不缓存);调用.cache().persist()可以帮助将已计算的结果保存在内存中或磁盘上.

    Spark is capable of caching the RDDs it has computed. By default every time an RDD is being reused it will be recomputed (is not cached); calling .cache() or .persist() can help to keep the result already computed in-memory or on disk.

    内部每个执行器都有一个在执行和存储之间浮动的内存池(请参阅幻灯片.在博客文章中,对执行程序和存储内存之间的平衡进行了很好的描述,其中也有一个很好的例子:

    Internally each executor has a memory pool that floats between execution and storage (see here for more details). When there is not enough memory for a task execution, Spark first tries to evict some storage cache, and then spills task data on disk. See these slides for further details. Balancing between executor and storage memory is well described in this blog post, which also has a nice illustration:

    OutOfMemory通常不是直接由于输入数据量较大而发生的,而是由于分区不良以及辅助数据结构较大而导致的,例如reducer上的HashMap(

    OutOfMemory often happens not directly because of large input data, but because of poor partitioning and hence large auxiliary data structures, like HashMap on reducers (here documentation again advises to have more partitions than executors). So, no, OutOfMemory will not happen just because of big input, it may be very slow to process though (since it will have to write/read from disk). They also suggest that using tasks as small as 200ms (in running time) is Ok for Spark.

    概述:正确分割数据:每个内核有1个以上的分区,每个任务的运行时间应大于200毫秒.默认分区是一个很好的起点,请手动调整参数.

    Outline: split your data properly: more than 1 partition per core, running time of each task should be >200 ms. Default partitioning is a good starting point, tweak the parameters manually.

    (我建议在1/8集群上使用输入数据的1/8子集来找到最佳的分区数.)

    (I would suggest to use a 1/8 subset of input data on a 1/8 cluster to find optimal number of partitions.)

    简短答案:他们确实如此.有关更多详细信息,请查看我上面提到的幻灯片(从幻灯片#32).

    Short answer: they do. For more details, check out the slides I mentioned above (starting from slide #32).

    所有N个任务都会获得可用内存的第N部分,因此会影响彼此的并行度".如果我很好地解释了您对"真正的并行性"的想法,那就是充分利用CPU资源".在这种情况下,是的,较小的内存池将导致磁盘上的数据溢出,并且计算将成为IO绑定(而不是CPU绑定).

    All N tasks get N-th portion of the memory available, hence affect each other's "parallelism". If I interpret your idea of true parallelism well, it is "full utilization of CPU resources". In this case, yes, small pool of memory will result in spilling data on disk and the computations becoming IO-bound (instead of being CPU-bound).

    我强烈建议整章 Tuning Spark Spark编程指南.另请参见Alexey Grishchenko在 Spark内存管理上的博客文章.

    I would highly recommend the entire chapter Tuning Spark and Spark Programming Guide in general. See also this blog post on Spark Memory Management by Alexey Grishchenko.

    这篇关于Spark任务内存分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆