星火出内存 [英] Spark out of memory

查看:156
本文介绍了星火出内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个的txt文件150克(约700文件,平均每200 MB)的文件夹。

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

我使用Scala处理的文件,并计算出最终的一些汇总统计信息。我看到两种可能的方法来做到这一点:

I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:


  • 通过所有的文件手动循环,做到每个文件计算和最终
  • 的结果合并
  • 阅读整个文件夹一个RDD,做这个单一RDD的操作,让火花做所有的并行

我倾向于第二种方法,因为它似乎更清洁(无需并行特定code),但我想知道如果我的情况适合我的硬件和数据所带来的限制。我有一个工作站,16个线程和64 GB的可用RAM(所以并行将是不同的处理器内核之间严格的地方)。我以后可能规模上与更多的计算机基础设施,但现在我只想专注于调谐设定为这一个工作站场景。

I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.

我使用的code:
- 读取TSV文件和有意义的数据提取物(字符串,字符串,字符串)三胞胎
- 后来一些过滤,映射和分组进行
- 最后,数据被减少,并且计算的某些聚集体

The code I'm using: - reads TSV files, and extracts meaningful data to (String, String, String) triplets - afterwards some filtering, mapping and grouping is performed - finally, the data is reduced and some aggregates are calculated

我已经能够用一个单一的文件(〜200 MB的数据)来运行该code,但是我收到java.lang.OutOfMemoryError:GC开销超过限制
 和/或Java堆出异常增加时,更多的数据(与数据6GB的应用程序中断,但我想用150 GB数据的使用)。

I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).

我想我会调整一些参数,使这项工作。我想AP preciate对如何处理这个问题的任何提示(如何调试内存的需求)。我试着增加'spark.executor.memory,并使用内核的数量较少(理性是每一个核心需要一些堆空间),但这并没有解决我的问题。

I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.

我不需要的解决方案是非常快的(如果需要,它可以很容易地进行几个小时运行甚至数天)。我也没有任何缓存数据,而只是将它们保存到中端的文件系统。如果你认为这将是更可行只用手动方式并行走,我能做到这一点。

I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.

推荐答案

我和我的团队已经处理的CSV数据大小超过1 TB超过5机@的RAM 32GB成功。这在很大程度上取决于你在做什么样的处理,以及如何。

Me and my team had processed a csv data sized over 1 TB over 5 machine @32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.


  1. 如果您重新分区的RDD,它需要额外的计算即
    已开销上面堆大小,尝试加载该文件更多
    paralelism通过降低分大小
    TextInputFormat.SPLIT_MINSIZE TextInputFormat.SPLIT_MAXSIZE
    (如果你使用的TextInputFormat)提升水平
    paralelism。

  1. If you repartition an RDD, it requires additional computation that has overhead above your heap size, try loading the file with more paralelism by decreasing split-size in TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE (if you're using TextInputFormat) to elevate the level of paralelism.

尝试使用,而不是地图mapPartition这样你就可以处理
一个分区内的计算。如果计算使用临时
变量或实例,你仍然面临淘汰的内存,请尝试
降低每分区数据的数量(增加分区
号)

Try using mapPartition instead of map so you can handle the computation inside a partition. If the computation uses a temporary variable or instance and you're still facing out of memory, try lowering the number of data per partition (increasing the partition number)

使用增加驾驶员记忆和执行者的内存限制
spark.executor.memory和火花spark.driver.memory
创建前的配置星火上下文

Increase the driver memory and executor memory limit using "spark.executor.memory" and "spark.driver.memory" in spark configuration before creating Spark Context

请注意,Spark是通用的集群计算系统,所以它的效率不高(恕我直言)在一台机器使用星火

Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine

这篇关于星火出内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆