星火出内存 [英] Spark out of memory

查看：156 发布时间：2016/5/22 15:51:06 scala apache-spark

本文介绍了星火出内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个的txt文件150克（约700文件，平均每200 MB）的文件夹。

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

我使用Scala处理的文件，并计算出最终的一些汇总统计信息。我看到两种可能的方法来做到这一点：

I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:

通过所有的文件手动循环，做到每个文件计算和最终

阅读整个文件夹一个RDD，做这个单一RDD的操作，让火花做所有的并行

我倾向于第二种方法，因为它似乎更清洁（无需并行特定code），但我想知道如果我的情况适合我的硬件和数据所带来的限制。我有一个工作站，16个线程和64 GB的可用RAM（所以并行将是不同的处理器内核之间严格的地方）。我以后可能规模上与更多的计算机基础设施，但现在我只想专注于调谐设定为这一个工作站场景。

I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.

我使用的code：
- 读取TSV文件和有意义的数据提取物（字符串，字符串，字符串）三胞胎
- 后来一些过滤，映射和分组进行
- 最后，数据被减少，并且计算的某些聚集体

The code I'm using: - reads TSV files, and extracts meaningful data to (String, String, String) triplets - afterwards some filtering, mapping and grouping is performed - finally, the data is reduced and some aggregates are calculated

我已经能够用一个单一的文件（〜200 MB的数据）来运行该code，但是我收到java.lang.OutOfMemoryError：GC开销超过限制
和/或Java堆出异常增加时，更多的数据（与数据6GB的应用程序中断，但我想用150 GB数据的使用）。

I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).

我想我会调整一些参数，使这项工作。我想AP preciate对如何处理这个问题的任何提示（如何调试内存的需求）。我试着增加'spark.executor.memory，并使用内核的数量较少（理性是每一个核心需要一些堆空间），但这并没有解决我的问题。

I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.

我不需要的解决方案是非常快的（如果需要，它可以很容易地进行几个小时运行甚至数天）。我也没有任何缓存数据，而只是将它们保存到中端的文件系统。如果你认为这将是更可行只用手动方式并行走，我能做到这一点。

I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.

星火出内存 [英] Spark out of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火出内存 [英] Spark out of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭