我们如何计算输入数据大小并输入要重新分区/合并的分区数? [英] How do we calculate the input data size and feed the number of partitions to re-partition/coalesce?

查看:97
本文介绍了我们如何计算输入数据大小并输入要重新分区/合并的分区数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例-现在假设我们有一个输入RDD输入,该输入已在第二步中进行了过滤.现在,我要计算过滤后的RDD中的数据大小,并考虑到块大小为128MB,计算需要重新划分多少个分区

Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB

这将帮助我将分区数传递给重新分区方法.

This will help me out to pass the number of partitions to repartition method.

InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)

Q1.如何计算XX的值?

Q1.How to calculate the value of XX ?

第二季度.SparkSQL/DataFrame的相似方法是什么?

Q2.What is the similar approach for Spark SQL/DataFrame?

推荐答案

仅当从HDFS读取数据/向HDFS写入数据时,才会显示128MB的块大小.一旦创建了RDD,数据就会根据执行程序RAM的大小存储在内存中或溢出到磁盘上.

The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.

除非对已过滤的RDD调用collect()操作,否则不可以计算数据大小.

You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.

最大分区大小为2GB,您可以根据群集大小或数据模型选择分区数.

The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.

 df.partition(col)

这篇关于我们如何计算输入数据大小并输入要重新分区/合并的分区数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆