我们如何计算输入数据大小并提供分区数量以重新分区/合并? [英] How do we calculate the input data size and feed the number of partitions to re-partition/coalesce?

查看:26
本文介绍了我们如何计算输入数据大小并提供分区数量以重新分区/合并?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例 - 现在假设我们有一个输入 RDD 输入,它在第二步中被过滤.现在我想计算过滤后的RDD中的数据大小,并通过考虑块大小为128MB来计算重新分区需要多少个分区

Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB

这将帮助我将分区数传递给重新分区方法.

This will help me out to pass the number of partitions to repartition method.

InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)

Q1.如何计算XX的值?

Q1.How to calculate the value of XX ?

Q2.Spark SQL/DataFrame 的类似方法是什么?

Q2.What is the similar approach for Spark SQL/DataFrame?

推荐答案

128MB 的块大小只有在从/向 HDFS 读取/写入数据时才会出现.创建 RDD 后,根据执行程序 RAM 大小,数据是否在内存中或溢出到磁盘.

The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.

除非对过滤后的 RDD 调用 collect() 操作,否则您无法计算数据大小,不推荐这样做.

You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.

最大分区大小为2GB,您可以根据集群大小或数据模型选择分区数量.

The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.

 df.partition(col)

这篇关于我们如何计算输入数据大小并提供分区数量以重新分区/合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆