Sparklyr中的堆空间已用完，但有足够的内存 [英] Running out of heap space in sparklyr, but have plenty of memory

查看：84 发布时间：2020/9/4 2:20:06 r apache-spark dplyr sparklyr

本文介绍了Sparklyr中的堆空间已用完，但有足够的内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

即使是相当小的数据集，我也会出现堆空间错误.我可以确定我没有耗尽系统内存.例如，考虑一个包含约2000万行和9列的数据集，该数据集在磁盘上占用1GB的空间.我正在30GB内存的Google Compute节点上玩它.

I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory.

比方说，我在名为df的数据帧中有此数据.以下程序可以正常运行，尽管运行缓慢:

Let's say that I have this data in a dataframe called df. The following works fine, albeit somewhat slowly:

library(tidyverse) 
uniques <- search_raw_lt %>%
    group_by(my_key) %>%
    summarise() %>%
    ungroup()

以下引发java.lang.OutOfMemoryError: Java heap space.

library(tidyverse)
library(sparklyr)
sc <- spark_connect(master = "local")

df_tbl <- copy_to(sc, df)

unique_spark <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

我尝试了此建议，用于增加Spark的堆空间.问题仍然存在.观察htop上的计算机状态，我发现总内存使用量从未超过10gb.

I tried this suggestion for increasing the heap space to Spark. The problem persists. Watching the machine's state on htop, I see that total memory usage never goes over about 10gb.

library(tidyverse)
library(sparklyr)

config <- spark_config()
config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"

sc <- spark_connect(master = "local")

df_tbl <- copy_to(sc, df)

unique_spark <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

最后，根据Sandeep的评论，我尝试将MaxHeapSize降低到4G. (是每个虚拟工作者还是整个Spark本地实例的MaxHeapSize?)我仍然遇到堆空间错误，再次，我没有使用过多的系统内存.

Finally, per Sandeep's comment, I tried lowering MaxHeapSize to 4G. (Is MaxHeapSize per virtual worker or for the entire Spark local instance?) I still got the heap space error, and again, I did not use much of the system's memory.

推荐答案

在研究Sandeep的建议时，我开始深入研究sparklyr 部署说明.这些提到驱动程序在此阶段可能会耗尽内存，并进行一些设置来更正它.

In looking into Sandeep's suggestions, I started digging into the sparklyr deployment notes. These mention that the driver might run out of memory at this stage, and to tweak some settings to correct it.

这些设置没有解决问题，至少最初没有解决.但是，将问题隔离到collect阶段使我能够找到问题使用上的SparkR.

These settings did not solve the problem, at least not initially. However, isolating the problem to the collect stage allowed me to find similar problems using SparkR on SO.

这些答案部分取决于设置环境变量SPARK_MEM.放在一起，我可以按以下方式工作:

These answers depended in part on setting the environment variable SPARK_MEM. Putting it all together, I got it to work as follows:

library(tidyverse)
library(sparklyr)

# Set memory allocation for whole local Spark instance
Sys.setenv("SPARK_MEM" = "13g")

# Set driver and executor memory allocations
config <- spark_config()
config$spark.driver.memory <- "4G"
config$spark.executor.memory <- "1G"

# Connect to Spark instance
sc <- spark_connect(master = "local")

# Load data into Spark
df_tbl <- copy_to(sc, df)

# Summarise data
uniques <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

这篇关于Sparklyr中的堆空间已用完，但有足够的内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Sparklyr中的堆空间已用完，但有足够的内存 [英] Running out of heap space in sparklyr, but have plenty of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Sparklyr中的堆空间已用完，但有足够的内存 [英] Running out of heap space in sparklyr, but have plenty of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭