Spark配置因内存不足错误 [英] Spark configuration for Out of memory error

查看:376
本文介绍了Spark配置因内存不足错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

集群设置-

Driver has 28gb
Workers have 56gb each (8 workers)

配置-

spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g

我的工作-

//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.


//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)

//works fine
data.map(s => myFunc(s))

此外,我从程序中解簇/删除了火花,它在具有56gb内存的单台服务器上运行得很好(成功保存到文件中).这表明这只是一个火花配置问题.我查看了 https://spark.apache.org/docs/latest/configuration.html#memory-management 以及我目前拥有的配置似乎已经需要更改,我的工作才能正常进行.我还应该改变什么?

Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?

更新-

数据-

val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)

注意-即使效率非常低,它也不会引起任何内存问题.我无法使用spark读取它,因为它引发了一些EOF错误.

Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.

myFunc,因为它很复杂,所以我不能真正发布它的代码.但基本上,输入字符串是限定字符串,它执行了一些限定符替换,日期/时间规范化等操作.输出字符串的大小将与输入字符串的大小大致相同.

myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.

此外,它对于较小的数据大小也可以正常工作,并且输出正确,并且大小应该与输入数据文件大致相同.

Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.

推荐答案

如果您提供有关MAP之前和之后程序中正在进行的操作的更多详细信息,将提供帮助. 除非触发了动作,否则第二条命令(仅Map)不会执行任何操作.您的文件可能未分区,驱动程序正在执行工作.下面应该强制将数据平均分配给工作人员,并在单个节点上保护OOM.但这会导致数据改组.

Would help if you put more details of what going on in your program before and after the MAP. Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.

查看代码后更新解决方案,这样做会更好

Updating solution after looking at your code, will be better if you do this

val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

这篇关于Spark配置因内存不足错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆