如何根据行数重新分区 Spark 数据帧? [英] How to repartition Spark dataframe depending on row count?

查看:25
本文介绍了如何根据行数重新分区 Spark 数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个简单的程序来请求一个巨大的数据库.为了导出我的结果,我写了这个函数:

I wrote a simple program that request a huge database. To export my result, I wrote this function:

result.coalesce(1).write.options(Map("header" -> "true", "delimiter"  > ";")).csv(mycsv.csv)

我使用 coalesce 方法只得到一个文件作为输出.问题是结果文件包含超过一百万行.所以,我无法在 Excel 中打开它...

I use the coalesce method to have only get one file as an output. The problem is that the result file contains more than one million lines. So, I couldn't open it in Excel...

因此,我考虑使用一种方法(或使用 for 循环编写自己的函数),该方法可以创建与我的文件中的行数相关的分区.但我不知道我该怎么做.

So, I thought about using a method (or write my own function using a for loop) that can create partitions related to the number of the lines in my file. But I have no idea how can I do this.

我的想法是,如果我的行少于一百万,我将有一个分区.如果我有超过一百万 => 两个分区,2 百万 => 3 个分区等等.

My idea is that if I have less than one million line, I will have one partition. If I have more than one million => two partitions, 2 millions => 3 partitions and so on.

可以做这样的事情吗?

推荐答案

您可以根据数据框中的行数更改分区数.

You can change the number of partition depending on the number of rows in the dataframe.

例如:

val rowsPerPartition = 1000000
val partitions = (1 + df.count() / rowsPerPartition).toInt

val df2 = df.repartition(numPartitions=partitions)

然后像以前一样将新数据帧写入 csv 文件.

Then write the new dataframe to a csv file as before.

注意:可能需要使用 repartition 而不是 coalesce 来确保每个分区中的行数大致相等,请参阅 Spark - repartition() 与合并 ().

Note: it may be required to use repartition instead of coalesce to make sure the number of rows in each partition are roughly equal, see Spark - repartition() vs coalesce().

这篇关于如何根据行数重新分区 Spark 数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆