为什么 Spark RDD 分区对 HDFS 有 2GB 的限制? [英] Why does Spark RDD partition has 2GB limit for HDFS?

查看:54
本文介绍了为什么 Spark RDD 分区对 HDFS 有 2GB 的限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 mllib RandomForest 训练数据时遇到错误.由于我的数据集很大,而默认分区相对较小.所以抛出异常表明大小超过Integer.MAX_VALUE",原始堆栈跟踪如下,

<块引用>

15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in阶段 6.0(TID 120、10.215.149.47):java.lang.IllegalArgumentException:大小超过 Integer.MAX_VALUE
在 sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) 在org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) 在org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) 在org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)在org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)在 org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)在org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)在 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

Integer.MAX_SIZE 是 2GB,好像有些分区内存不足.所以我将我的 rdd 分区重新分区为 1000,这样每个分区可以像以前一样保存更少的数据.终于,问题解决了!!!

所以,我的问题是:为什么分区大小有2G的限制?spark中好像没有配置限制设置

解决方案

spark 中块的基本抽象是 ByteBuffer,不幸的是它有一个 Integer.MAX_VALUE (~2GB) 的限制.

这是一个严重问题,它阻止对非常大的数据集使用 spark.增加分区数量可以解决这个问题(就像在 OP 的情况下一样),但并不总是可行的,例如当存在大量转换链时,其中一部分可以增加数据(flatMap 等)或在数据倾斜的情况下.

建议的解决方案是提出一个抽象,如 LargeByteBuffer,它可以支持字节缓冲区列表一个块.这会影响整个 Spark 架构,因此很长一段时间都没有解决.

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,

15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!

So, my question is : Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark

解决方案

The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).

It is a critical issue which prevents use of spark with very large datasets. Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.

The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.

这篇关于为什么 Spark RDD 分区对 HDFS 有 2GB 的限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆