Spark-java.lang.OutOfMemoryError:请求的数组大小超出了VM限制 [英] Spark - java.lang.OutOfMemoryError: Requested array size exceeds VM limit

查看:141
本文介绍了Spark-java.lang.OutOfMemoryError:请求的数组大小超出了VM限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对7个节点的集群上Cloudera的Spark(2.1.0)中的数据帧进行groupBy操作,该集群总共有大约512GB的RAM.我的代码如下.

I am attempting a groupBy operation on dataframe in Cloudera's Spark (2.1.0) on a 7 node cluster with about 512GB of RAM total. My code is as follows.

ndf = ndf.repartition(20000)
by_user_df = ndf.groupBy(ndf.name) \
            .agg(collect_list("file_name")) \
            .withColumnRenamed('collect_list(file_name)', 'file_names')


by_user_df = by_user_df.repartition(20000)    
by_user_df.count()

ndf是一个数据帧,包含2列,一个用户ID和一个文件名.我正在尝试通过userid创建文件名列表,以传递给CountVectorizer和集群.

ndf is a dataframe containing 2 columns, a userid and a filename. I am trying to create a list of filenames by userid for passing to CountVectorizer and clustering.

我收到以下错误

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

据我了解,这是由于分配的数组大于VM在连续内存中可以处理的数组,或者大于数组大小的系统最大值.许多建议是通过划分为更多的分区来实现更多并行化.

From what I have read, this is due to allocating an array either bigger than what the VM can handle in contiguous memory or larger than a system maximum for array size. Many of the recommendations are to parallelize more by splitting into more partitions.

我大约有6k用户,文件名总数约为7k.我已经注意到,去世的遗嘱执行人将大部分时间都花在了垃圾回收上.

I have about 6k users and about 7k total filenames. I have noticed that the executor that dies spends the majority of its time in Garbage Collection.

到目前为止,我已经尝试了以下方法:

I have tried the following this far:

  1. 重新划分ndf数据帧和结果数据帧.我已经为每个参数尝试了多达60k的分区参数.
  2. 我已逐步将"spark.sql.shuffle.partitions"设置为20000
  3. 我将执行程序的内存增加到25G
  4. 即使死掉的执行者似乎不是驱动程序,我也将驱动程序内存提高到25G.

对此问题的更新:我意识到在这种情况下,我正在对数据进行二进制聚类,因此我实际上只需要每个文件名之一.将 collect_list 更改为 collect_set 后,我得到了我需要的输出,并且显然足够小,可以在给定的参数内运行.我仍将尝试解决原始问题.

As an update to this question: I realized that in this case I am doing a binary clustering over the data so I really need just one of each of the filenames. Changing collect_list to collect_set left me with the output that I needed and was apparently small enough to run within the given parameters. I'm still going to try to fix the original case.

推荐答案

首先,我真的不明白为什么需要如此高的分区价值.我不知道您的7个工作人员中每个工作人员有多少个核心,但是我怀疑您需要200个以上的分区(正在使用的大量分区实际上可以解释您的工作人员为何死于垃圾回收)

First of all I don't really understand why you need such a high value of partitions. I don't know how many cores you have on each of the 7 workers but I doubt you need more than 200 partitions (The extremely high amounts of partitions you are using may actually explain why your workers die from Garbage Collection)

您的问题看起来像是JVM定义中的内存问题,因此我认为没有理由增加驱动程序或工作程序的内存.

Your problem looks like a memory problem within the definitions of the JVM so I see no reason to boost driver or workers memory.

我认为您需要设置Xss或Xmx或MaxPermSize,如此处所示:

I think what you need is to set the Xss or Xmx or MaxPermSize like explined here: How to fix "Requested array size exceeds VM limit" error in Java?

为此,您需要在运行spark时使用--conf spark.driver.extraJavaOptions和--conf spark.executor.extraJavaOptions.

To do so you need to use --conf spark.driver.extraJavaOptions and --conf spark.executor.extraJavaOptions when you run spark.

例如:

--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M " --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=128M "

这篇关于Spark-java.lang.OutOfMemoryError:请求的数组大小超出了VM限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆