Sparklyr中的堆空间已用完,但有足够的内存 [英] Running out of heap space in sparklyr, but have plenty of memory
问题描述
即使是相当小的数据集,我也会出现堆空间错误.我可以确定我没有耗尽系统内存.例如,考虑一个包含约2000万行和9列的数据集,该数据集在磁盘上占用1GB的空间.我正在30GB内存的Google Compute节点上玩它.
I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory.
比方说,我在名为df
的数据帧中有此数据.以下程序可以正常运行,尽管运行缓慢:
Let's say that I have this data in a dataframe called df
. The following works fine, albeit somewhat slowly:
library(tidyverse)
uniques <- search_raw_lt %>%
group_by(my_key) %>%
summarise() %>%
ungroup()
以下引发java.lang.OutOfMemoryError: Java heap space
.
library(tidyverse)
library(sparklyr)
sc <- spark_connect(master = "local")
df_tbl <- copy_to(sc, df)
unique_spark <- df_tbl %>%
group_by(my_key) %>%
summarise() %>%
ungroup() %>%
collect()
我尝试了此建议,用于增加Spark的堆空间.问题仍然存在.观察htop
上的计算机状态,我发现总内存使用量从未超过10gb.
I tried this suggestion for increasing the heap space to Spark. The problem persists. Watching the machine's state on htop
, I see that total memory usage never goes over about 10gb.
library(tidyverse)
library(sparklyr)
config <- spark_config()
config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"
sc <- spark_connect(master = "local")
df_tbl <- copy_to(sc, df)
unique_spark <- df_tbl %>%
group_by(my_key) %>%
summarise() %>%
ungroup() %>%
collect()
最后,根据Sandeep的评论,我尝试将MaxHeapSize
降低到4G
. (是每个虚拟工作者还是整个Spark本地实例的MaxHeapSize
?)我仍然遇到堆空间错误,再次,我没有使用过多的系统内存.
Finally, per Sandeep's comment, I tried lowering MaxHeapSize
to 4G
. (Is MaxHeapSize
per virtual worker or for the entire Spark local instance?) I still got the heap space error, and again, I did not use much of the system's memory.
推荐答案
在研究Sandeep的建议时,我开始深入研究sparklyr
部署说明.这些提到驱动程序在此阶段可能会耗尽内存,并进行一些设置来更正它.
In looking into Sandeep's suggestions, I started digging into the sparklyr
deployment notes. These mention that the driver might run out of memory at this stage, and to tweak some settings to correct it.
这些设置没有解决问题,至少最初没有解决.但是,将问题隔离到collect
阶段使我能够找到问题使用上的SparkR.
These settings did not solve the problem, at least not initially. However, isolating the problem to the collect
stage allowed me to find similar problems using SparkR on SO.
这些答案部分取决于设置环境变量SPARK_MEM
.放在一起,我可以按以下方式工作:
These answers depended in part on setting the environment variable SPARK_MEM
. Putting it all together, I got it to work as follows:
library(tidyverse)
library(sparklyr)
# Set memory allocation for whole local Spark instance
Sys.setenv("SPARK_MEM" = "13g")
# Set driver and executor memory allocations
config <- spark_config()
config$spark.driver.memory <- "4G"
config$spark.executor.memory <- "1G"
# Connect to Spark instance
sc <- spark_connect(master = "local")
# Load data into Spark
df_tbl <- copy_to(sc, df)
# Summarise data
uniques <- df_tbl %>%
group_by(my_key) %>%
summarise() %>%
ungroup() %>%
collect()
这篇关于Sparklyr中的堆空间已用完,但有足够的内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!