sparklyr 堆空间不足,但内存充足 [英] Running out of heap space in sparklyr, but have plenty of memory

查看:13
本文介绍了sparklyr 堆空间不足,但内存充足的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使在相当小的数据集上,我也会遇到堆空间错误.我可以确定我没有耗尽系统内存.例如,考虑一个包含大约 2000 万行和 9 列的数据集,它在磁盘上占用 1GB.我在具有 30GB 内存的 Google Compute 节点上使用它.

I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory.

假设我在名为 df 的数据帧中有这些数据.以下工作正常,虽然有点慢:

Let's say that I have this data in a dataframe called df. The following works fine, albeit somewhat slowly:

library(tidyverse) 
uniques <- search_raw_lt %>%
    group_by(my_key) %>%
    summarise() %>%
    ungroup()

以下抛出java.lang.OutOfMemoryError: Java heap space.

library(tidyverse)
library(sparklyr)
sc <- spark_connect(master = "local")

df_tbl <- copy_to(sc, df)

unique_spark <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

我尝试了这个建议来增加 Spark 的堆空间.问题仍然存在.在 htop 上观察机器的状态,我看到总内存使用量从未超过 10GB.

I tried this suggestion for increasing the heap space to Spark. The problem persists. Watching the machine's state on htop, I see that total memory usage never goes over about 10gb.

library(tidyverse)
library(sparklyr)

config <- spark_config()
config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"

sc <- spark_connect(master = "local")

df_tbl <- copy_to(sc, df)

unique_spark <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

最后,根据 Sandeep 的评论,我尝试将 MaxHeapSize 降低到 4G​​.(MaxHeapSize 是每个虚拟工作者还是整个 Spark 本地实例?)我仍然遇到堆空间错误,而且我没有使用太多系统内存.

Finally, per Sandeep's comment, I tried lowering MaxHeapSize to 4G. (Is MaxHeapSize per virtual worker or for the entire Spark local instance?) I still got the heap space error, and again, I did not use much of the system's memory.

推荐答案

在研究 Sandeep 的建议后,我开始深入研究 sparklyr 部署说明.这些提到驱动程序在此阶段可能会耗尽内存,并调整一些设置以纠正它.

In looking into Sandeep's suggestions, I started digging into the sparklyr deployment notes. These mention that the driver might run out of memory at this stage, and to tweak some settings to correct it.

这些设置并没有解决问题,至少最初没有.但是,将问题隔离到 collect 阶段让我找到了 类似 问题 在 SO 上使用 SparkR.

These settings did not solve the problem, at least not initially. However, isolating the problem to the collect stage allowed me to find similar problems using SparkR on SO.

这些答案部分取决于设置环境变量 SPARK_MEM.综合起来,我让它按如下方式工作:

These answers depended in part on setting the environment variable SPARK_MEM. Putting it all together, I got it to work as follows:

library(tidyverse)
library(sparklyr)

# Set memory allocation for whole local Spark instance
Sys.setenv("SPARK_MEM" = "13g")

# Set driver and executor memory allocations
config <- spark_config()
config$spark.driver.memory <- "4G"
config$spark.executor.memory <- "1G"

# Connect to Spark instance
sc <- spark_connect(master = "local")

# Load data into Spark
df_tbl <- copy_to(sc, df)

# Summarise data
uniques <- df_tbl %>%
  group_by(my_key) %>%
  summarise() %>%
  ungroup() %>%
  collect()

这篇关于sparklyr 堆空间不足,但内存充足的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆