为什么星火RDD分区有2GB的限制HDFS？ [英] Why does Spark RDD partition has 2GB limit for HDFS?

查看：1157 发布时间：2016/5/22 15:22:19 scala apache-spark rdd

本文介绍了为什么星火RDD分区有2GB的限制HDFS？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在使用mllib随机森林训练数据时，得到一个错误。由于我的数据集是巨大的，默认的分区是相对较小的。因此抛出异常，表明大小超过Integer.MAX_VALUE的还原最初堆栈跟踪如下，

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,

15/04/16 14时13分03秒WARN scheduler.TaskSetManager：失落的任务在19.0
  舞台6.0（TID 120，10.215.149.47）：
  java.lang.IllegalArgumentException异常：大小超过Integer.MAX_VALUE的结果
  在sun.nio.ch.FileChannelImpl.map（FileChannelImpl.java:828）在
  org.apache.spark.storage.DiskStore.getBytes（DiskStore.scala：123）在
  org.apache.spark.storage.DiskStore.getBytes（DiskStore.scala：132）在
  org.apache.spark.storage.BlockManager.doGetLocal（BlockManager.scala：517）
  在
  org.apache.spark.storage.BlockManager.getLocal（BlockManager.scala：432）
  在org.apache.spark.storage.BlockManager.get（BlockManager.scala：618）
  在
  org.apache.spark.CacheManager.putInBlockManager（CacheManager.scala：146）
  在org.apache.spark.CacheManager.getOrCompute（CacheManager.scala：70）

15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

该Integer.MAX_SIZE是2GB，似乎有些分区内存不足。所以我repartiton我RDD分区1000年，使每个分区可以保存远不数据之前。最后，该问题得以解决!!!

The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!

所以，我的问题是：
为什么分区大小有2G的限制？这似乎是没有配置火花的限制设置

So, my question is : Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark

为什么星火RDD分区有2GB的限制HDFS？ [英] Why does Spark RDD partition has 2GB limit for HDFS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么星火RDD分区有2GB的限制HDFS？ [英] Why does Spark RDD partition has 2GB limit for HDFS?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭