为什么星火RDD分区有2GB的限制HDFS? [英] Why does Spark RDD partition has 2GB limit for HDFS?

查看:1157
本文介绍了为什么星火RDD分区有2GB的限制HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用mllib随机森林训练数据时,得到一个错误。由于我的数据集是巨大的,默认的分区是相对较小的。因此抛出异常,表明大小超过Integer.MAX_VALUE的还原最初堆栈跟踪如下,

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,

15/04/16 14时13分03秒WARN scheduler.TaskSetManager:失落的任务在19.0
  舞台6.0(TID 120,10.215.149.47):
  java.lang.IllegalArgumentException异常:大小超过Integer.MAX_VALUE的结果
  在sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)在
  org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)在
  org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)在
  org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
  在
  org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
  在org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
  在
  org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
  在org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

该Integer.MAX_SIZE是2GB,似乎有些分区内存不足。所以我repartiton我RDD分区1000年,使每个分区可以保存远不数据之前。最后,该问题得以解决!!!

The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!

所以,我的问题是:
为什么分区大小有2G的限制?这似乎是没有配置火花的限制设置

So, my question is : Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark

推荐答案

在火花块的基本抽象是的ByteBuffer ,不幸的是有Integer.MAX_VALUE的限制(〜2GB)。

The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).

这是一个关键问题其中prevents使用的火花具有非常大的数据集。
增加分区的数量可以解析它(如在OP的情况下),但并非总是可行的,例如,当有变换部件,其中可以增加数据(flatMap等)或者在数据是歪斜的情况下大链

It is a critical issue which prevents use of spark with very large datasets. Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.

提出的解决方案是要拿出像 LargeByteBuffer 的抽象,可以支持的ByteBuffers名单块。这会影响整体的火花架构,因此它已经相当长一段时间仍然没有解决。

The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.

这篇关于为什么星火RDD分区有2GB的限制HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆